Announcing Sparkoin 0.1

Did you ever dream of being able to analyze the Bitcoin Blockchain using Big Data tools?

Something like this:

If this is your dream too, now it is a bit more close to reality than before. I am pleased to announce to the world the 0.1 release of the open souce Sparkoin project.

Sparkoin is BigData for BitCoin. In practice, it is a system allowing to analyze the Blockchain using BigData technologies, most notably Hadoop and Spark.

What is Sparkoin

Sparkoin home is on GitHub, here: http://github.com/sciabarra/Sparkoin

In the current incarnation, it is a Docker based system, a set of Docker containers, including the scripts to build them.

Once you built the containers and started them, the blockchain will be downloaded and stored in Hadoop in an easily manageable JSon format.

You can then use as a notebook software (included) to analyze those data. As a front end, there is a Jupyter notebook with a Toree Spark Kernel.

A sample notebook sparkoin.ipynb is included.

How to use

You need Docker, either the Docker Toolbox on Windows or Mac, or the native Docker on Windows/Mac (currently in private beta). I actually used both, including the latest Docker beta since I got an invitation to test it.

You need at least version 1.10.x of Docker.

If you use a linux Box with Docker standalone you have also to download the docker-compose tool.

Configuration

First step is configuration. If you have the toolbox use the script configure-toolbox.sh otherwise configure.sh

configure-toolbox.sh

The configure script, for the toolbox, requires you choose as an IP, usually 192.168.99.99.

The configurator for toolbox will take care of creating a virtual machine for you , with a fixed IP alias in your toolbox virtual machine.

You can use any ip in the range 192.168.99.2 - 192.168.99.99. You need to use the subnet created by the toolbox and avoid IP that can be assigned to virtual machine by the DHCP.

Since it will also create the virtual machine, it wants also to know the size of the virtual disk for your virtual machine. If you want to import the whole Blockchain and store it in Hadoop you need at least 200 GB.

Example:

1
sh configure.sh 192.168.99.99 2000000

configure.sh

Otherwise, if you use a standalone Docker installation (either Linux or Mac/Windows beta) you have to use the “real” IP of your machine, that you can read with ifconfig or ipconfig.

Example:

1
sh configure-toolbox.sh 192.168.1.68

Build

After configuring, ensure you have a fast and unlimited internet connection and execute:

1
sh build.sh

It will download all the required software and create the docker images. Please be patient since it can take a long time (it depends how fast is the internet connection).

Before Running

To avoid filling you disk, the current configuration limits the download to the first 5000 blocks. If you want to download the whole blockchain, just remove the line with the variable BITCORE_STOP_AT in the docker-compose.yml.

If you do so, ensure you have have enough memory and disk space. Definitely the docker toolbox is not big nor fast enough to allow to downnload the whole blockchain. You need a dedicated machine with native docker and a lot of space (200GB minimum).

Running

Finally, you can start:

1
sh start.sh

It will start your images in background.
The bitcoin server (wrapped in bitcore) it will start downloading the blockchain in Hadoop.

You can then access to the Jupyter with http://YOUR-IP:8888 to perform your analysis. The YOUR-IP is the IP you choose when configured, either the local Virtual Box IP (192.168.99.99) or the IP of your Docker machine if you have one.

Analyze

As a starting point, there is provided a sparkoin.ipynb notebook.

All the blockchain is now stored in hadoop, accessible from the notebook as hdfs://hadoop.loc/blochain/X.json, where X can be in the range from 0 to the latest block downloaded.

It is included a simple library to parse your blockchain json in Scala case classes for easier analysis. Check the example notebook for usage.

What is Next

This is just the first usable release, mostly meant for developing a library useful for analysis.

Note it is actually meant for further development and it does not scale (yet).

It starts only a single node hadoop and a Spark runs only in local mode through Jupyter. However since it can actually gives you an easy access to the block chain I decided to share it with the world in the current state.

Next steps will be to make it scalable providing a kubernetes deployment, create a richer library for the analysis and create a front-end for displaying useful informations.

Enjoy, please report bugs and stay tuned for further development.