It’s 2024 and you’d think that getting crypto data is easy because you have Etherscan, Dune and Nansen that let you see data you want all the time. Well, kind of.
You see, in normal web2 land, when you have a company with 10-employees and 100,000 customers, the amount of data you’re producing is probably no more than 100s of giga bytes (on the upper hand). That scale of data is small enough your iPhone can crunch any questions you have and store everything. However, once you have 1,000 employees and 100,000,000 customers, the amount of data you’re probably dealing with is now in hundreds of terabytes, if not petabytes.
This is fundamentally an entirely different challenge since the scale you’re dealing with requires a lot more considerations. To process hundreds of terabytes of data, you need a distributed cluster of computers to send the jobs to. When sending these jobs you have to think about:
-
What happens if a worker fails to do their job
-
What happens if one worker takes a lot longer than the others
-
How do you figure which job to give which worker
-
How do you combine all of their results together and ensure the computation was done correctly
These are all considerations that you need to think about when dealing with big data compute across multiple machines. Scale breeds issues that are invisible to those who don’t work with it. Data is one of those domains where the more you scale up, the more infrastructure you need to manage it correctly. Invisible problems to most people. To handle this scale you also have additional challenges:
-
Extremely specialised talent that knows how to operate machines at this scale
-
The cost to store and compute all the data
-
Forward planning and architecture to ensure your needs can be supported
It’s funny, in web2 everyone wanted the data to be public. In web3, it finally is but very few know how to do the necessary work to make sense of it. One deceiving fact about this is that with some assistance, you can get your set of data from the global data set somewhat easily which means that “local” data is easy, however “global” data is hard to get (things that pertain to everyone and everything).
As if things aren’t already challenging with the scale you have to work with. There is a new dimension that makes crypto data challenging and that’s the fact you have continuous fragmentation due to financial incentives of the market. For example:
-
Rise of new blockchains. There are close to 50 L2s lives, 50 known to be upcoming and hundreds more in the pipeline. Each L2 is effectively a new database source that needs to be indexed and configured. Hopefully they’re standardised but you can’t always be sure!
-
Rise of new virtual machines. EVM is just one domain. SVM, Move VM and countless others are coming to market. Each new type of virtual machine means an entirely new data scheme that has to be considered from first principles and deep understanding. How many VMs are there? Well investors will incentivise a new to the tune of billions of dollars!
-
Rise of new account primitives. Smart contract wallets, hosted wallets, account abstraction throw a new complication into the mix of how you actually interpret a data. The from address may not actually be the real user because it was submitted by a relayed and the real user is somewhere in the mix (if you look hard enough).
Fragmentation can be particularly challenging given you can’t quantify what you don’t know. You will never know all the L2s that exist in the world and the virtual machines that will come out in total. You will be able to keep up once they reach enough scale but that’s a story for another time.
This last one I think catches a lot of people by surprise and it’s the fact that yes the data is open, but no it is not interoperable easily. You see, all the smart contracts that team pieces together is like a little database inside a larger database. I like to think of them as schemas. All the data is there, but how you piece it together is usually understood by the team that developed the smart contracts. You can spend time to understand it yourself if you’d like but you’ll have to do it hundreds of times for all the potential schemas — and how are you going to even afford to do that without burning through large sums of money without a buyer on the other side of the transaction?
In case this feels too abstract, let me provide an example. You say “How much does this user utilise bridges?”. Although that presents as one question, it has many nested problems in it. Let’s break it down:
-
You first need to know all the bridges that exist. Also on the chains that you care about it. If it’s all the chains, well we already mentioned above why this is challenging.
-
Then for each bridge you need to understand how their smart contracts work
-
Once you’ve understood all the permutations, you now need to reason through a model that can unify all these individual schemas
Each of the above challenges are very challenging to figure out and highly resource intensive.
So what does this all lead to? Well the state of the ecosystem we have today where…
-
Ecosystem where no one actually knows what’s truly happening. There’s just a hand-wavey notion of activity that is hard to properly quantify.
-
Inflated user counts and challenging to detect sybils. Metrics start to become irrelevant and untrustworthy! What’s real or fake doesn’t even matter to market participants because it all looks the same.
-
Main issues with making on-chain identity real. If you want to have a strong sense of identity, accurate data is critical otherwise your identity is being misrepresented!
I hope this article has helped open your eyes to the realities of the data landscape in crypto. If you are facing any of these issues or want to learn how to overcome them, reach out — my team and I are tackling these.
Read More: kermankohli.substack.com