DIY block scanner (Blockchain explorer): a little theory

DIY block scanner (Blockchain explorer): a little theory

In the last article we looked at the reasons why we might need our own blockexplorer. I note that this list is far from complete, but we will assume that we have decided - we need our own source of data about transactions and their connections with addresses.

Let's try to determine what we need for this. Obviously, we first need a copy of the desired blockchain, and this copy must remain current (synchronize with the corresponding network). The latter hints that either we need to implement a completely corresponding protocol (which will obviously be redundant and extremely expensive) or install the corresponding software, which is more rational. For Ethereum this will be for example Geth (go-ethreum), bitcoind or btcd (implementation in golang) for Bitcoin, or any corresponding software. The main condition is access to the full blockchain (or the part that you want to track). 

For clarity, let's remember how information should be stored in the blockchain. Bitcoin and its descendants (let’s call them “classic”) use the UTXO (Unspent Transaction Output) concept. The result of a transaction can be called an “Out”, and every transaction, with one exception, must refer to an “In”. Thus, each transaction contains a sending address and a recipient address in one form or another (we will not go into details now, the very fact of the presence of such information is sufficient). A side effect is that from this information we can also build a transaction tree that will allow us to track every Satoshi that passes through the network. 

The Ethereum network handles information a little differently. As I mentioned in the last article, the concept of Ethereum is different, and information is stored directly about address balances. However, information about transactions is still in blocks and stored on the network. Thus, indexing transactions and their relationship with addresses is still possible. In this article, I will deliberately not touch on smart contracts and operations with tokens (ERC20/ERC721/ERC1155) and the so-called “Internal Transactions” - consideration of this topic requires a separate article.

What is not available to us? Blockchains that use “Zero-knowledge proof” algorithms, such as Monero and ZCash, cannot be indexed in this way, as their design suggests. 

Then the process will be straightforward and simple, although extremely time-consuming and in terms of the amount of disk space you will need.. 

Step one - we request the initial block via RPC (implementation depends on the specific network) and, going through transactions one by one, decode them and select the information that we want to index. We will probably be interested in the transaction hash, the addresses used, the direction of the transaction, the time and the block number. The complete list depends on your tasks and on the resources available to you (mainly the bottleneck will be disk space).

Next, we save the information we are interested in in some version of the database, take the next block and repeat the described steps. Not very elegant, but unfortunately there is no other recipe. 

And a logical question arises - why don’t the blockchain developers themselves provide access to such obviously useful information? What prevents you from immediately saving and indexing such data when synchronizing with the network? 

The answer will overlap with the explanation of why disk space will be a bottleneck. 

First, a small example. Those who have deployed a node for Ethereum know that there are several synchronization options Geth. In the documentation, section “Sync modes” mentions Full nodes and Light nodes. The latter are not of interest to us, but Full nodes, in turn, are conditionally divided into Snap, Full and Archive. And this is where things get interesting, especially if you look closely at the illustration to see which of them is which. The Snap node stores detailed information about the last 128 blocks and a couple of checkpoints in the relatively recent past. The Full node, as follows from the illustration, stores checkpoints with a certain frequency almost until the “beginning of time.” Archive already stores complete information for the entire existence of Ethereum. If you set a goal and find out how much data a Full node stores, it turns out that it is about 1Tb (at the time of writing). We will not go into the mechanisms of “network reorganization” and other tricks. We are interested in something else, the amount of data stored by the Archive node is already about 16Tb(!). It turns out that more than 90% of the information about the blockchain is not available to us?

If you run one simple experiment, a lot of things fall into place. 

Let's use the etherscan service..io and find a random transaction “from the beginning of time”, for example 0x7d7062d6f865931e0bbbccea46551a73d5d58a6ef618d5592c35b5256a65e9ba(for about August 2015) and try to get information about it using the console of the local Geth, synchronized in full mode. We can get this information, right? 


Welcome to the Geth JavaScript console!

instance: Geth/v1.12.0-stable-e501b3b0/linux-amd64/go1.20.3
at block: 19122581 (Tue Jan 30 2024 23:24:11 GMT+0000 (UTC))
 datadir: /home/eth/ethereum
 modules: admin:1.0 debug:1.0 eth:1.0 miner:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0

To exit, press ctrl-d or type exit
> eth.getTransaction("0x7d7062d6f865931e0bbbccea46551a73d5d58a6ef618d5592c35b5256a65e9ba")

null

Um... So, there is no information about the transaction? Not really... Let's try to view the contents of the block in which the desired transaction is located, since we know the block number:

> eth.getBlockByNumber(122546)
{
...
   hash: "0xc58aa38cf7df6050d3be43f1557d61e3a28e3f34d7818b1644e6a9972003e80a",
...
   number: "0x1deb2",
...
   transactions: ["0x7d7062d6f865931e0bbbccea46551a73d5d58a6ef618d5592c35b5256a65e9ba"],
...
}


So here she is, in place! If we execute eth.getBlockByNumber(122546, true) (an additional parameter indicating a request to display all information about transactions in the block), then we will receive the contents of the transaction, and for example we will find out that the sender was the address 0x5685620dce626248ccb7121e87fbc098fd5310bd and the recipient 0x9b0a028eafdecde3afc0fd00b7937098388b7c8a, as well as all related information. Why couldn’t we get this information by asking directly?

Without going into technical details, only the archive node fully indexes the entire blockchain and all connections between blocks, transactions and addresses. And the “weight” of these indexes is precisely those missing 15 terabytes of information. 

This gives us an understanding of what volumes of data we should prepare for. However, it is not at all necessary that your specific tasks require such detailed indexing; it is quite possible that your database will be somewhat more compact. 

As for bitcoin, there are similar mechanisms, but the volume of data is incomparably smaller. At the time of writing, the volume of the Bitcoin blockchain is a ridiculous 545Gb, so there will be significantly fewer problems with it..

In conclusion, a few words about how you can speed up the indexing process. It is not at all necessary to process all requests through contacting the client of the corresponding network. Most often, software for working with blockchain is open, and database formats are quite standardized. It may be worth considering the option of working directly with an already synchronized node deposited on disk, and only then synchronizing new blocks. 

In the next article we will try to assemble a minimal implementation from available tools to prepare a search database for addresses for different blockchains.

You May Also Like

822017-12-19

ICO - a tool for the digital crypto-currency economy

ICO (Initial Coin Offering) joined the list of tools of the cryptocurrency economy not so long ago. Being an analogue of the IPO (Initial Public Offering) of the traditional economy, ICO still has differences adapted to the specific properties of cryptocurrency technologies.

Education
562017-12-23

In Belarus, the capital’s university is ready to train those interested in operations with cryptocurrencies

The Institute for Advanced Studies and Retraining of Personnel of the Belarusian National Technical University has announced enrollment in the program “Cryptocurrencies and Derivatives” and invites people with higher education, as well as senior university students. Classes are expected to begin on March 12, 2018.

Education

Latest articles from Education category