On data indexing
4 minute(s)
Introduction
Most people don’t need to think about “indexing”. When you build an app, you decide how you structure your data. If I’m building Twitter/X, I have a list of all my users, and a list of all their tweets. These two lists tie together so I know who tweeted what, and can present that on my website / app.
Things get more complicated from there: I need to map usernames to display names, manage media like photos and videos, and more. But generally, I can set things up how I want, and I can use the right tool for each job:
- to store text and search it easily, I use one type of database
- to store images and media, I use another one
- for analytics, yet another one
What I described above is not how things work in crypto. The basis of the technology is that everyone is working off a shared database: “a blockchain” (whether that's Bitcoin, Ethereum, Solana). And this database is great for allowing a lot of different people to run it at the same time and all come to consensus on what the database says, but it’s really bad at most other things. Blockchains have 3 core problems when it comes to data.
Problem
#1: Generic structure
Blockchains were not designed for app-friendly reads. They expose a handful of primitive data structures:
- blocks
- logs
- traces
- transactions
These basic structures work well for maintaining a ledger, but terrible for anything even one layer of abstraction away. If you’re building an NFT app, you don’t want “logs”, you want “NFT transfers.” If you’re building a decentralized exchange, you don’t want “traces”. You want “swaps,” “pools,” “positions,” and “24h volume.”
But the chain doesn’t give you these things. It gives you the raw ingredients, often encoded, and spread across multiple different places.
#2: Noisy
In a normal app, your database contains your app’s data. On a blockchain, your app’s activity lives in the same global history as everything else. Your five-user app shares the same underlying “book” as every trade that ever happened. If you want to answer a simple product question (“who owns these NFTs?”), you’re forced to sift through the entire world’s activity to find the tiny slice you care about. Imagine if Facebook had to filter out every Amazon product and order in order to render your news feed.
#3: Unoptimized
Even worse: the underlying databases used by blockchain nodes are primarily built to support the network’s write/verification needs (executing transactions, keeping state), not read workloads. The data is stored in an inefficient layer, often hashed (encrypted to minimize storage requirements requirements) and with key information required to decode it not sitting in the database at all.
Solution
Indexing is the missing middle layer that solves these 3 problems, turning the encoded, noisy, and generic data from a blockchain into contextual and specific databases that you can actually build with. They sit between:
- what you’re building (apps, dashboards, alerts)
- and how blockchain data actually exists (blocks/logs/traces/transactions)
Their job is to turn “raw chain exhaust” into “app-ready data”. In practice, that means:
- extracting only the events/state you care about
- decoding and normalizing it into readable fields
- organizing it into useful entities (eg. swaps)
- storing it somewhere queryable
- keeping it updated as new transactions happen
If you like metaphors: blockchains are like a giant book, written in hieroglyphics by many authors at once, all writing unrelated sentences one after the other. Indexers translate those hieroglyphics into English, and pull out only the sentences you care about. They add in an index, table of contents, clean chapters, and even summaries and statistics for quantifiable components. They do this following the exact instructions you give them.
Conclusion
Blockchains are not application databases. They’re global, write-optimized ledgers with a minimal, awkward read interface. Indexers exist because the raw chain is not a usable application database, and people building apps need:
- structured, domain-shaped data
- fast and flexible queries
- real-time updates
- reliability at scale
There’s many different approaches to indexing - Goldsky is one of them (in my unbiased opinion, the best), but there are others. Going into the options there is probably a job for a separate followup post. And why this effort is worth the trouble is a whole separate question as well. But for now, I hope this gives you a clearer picture of what indexing is and why it matters in the blockchain space.
Table of Contents
Date
Tl;dr
Meta