Bitcoin Address Clustering - Vincenzo Imperati

This project extracts information from the Bitcoin blockchain to create transaction graphs and perform chain analysis through Bitcoin address clustering. The main goal is to group Bitcoin addresses that likely belong to the same entity using various heuristic methods, enabling better understanding of Bitcoin transaction flows and entity behaviors.

Features#

Blockchain Data Collection: Automated extraction of Bitcoin blockchain data from blocks 0 to 115,000
Transaction Graph Generation: Creates directed graphs with transactions as nodes and UTXOs as edges
Multiple Heuristics: Implements 10+ different clustering heuristics
Spark Integration: Uses Apache Spark for distributed data processing
Interactive Visualization: Web-based interface for exploring address clusters
Chain Analysis: Tools for analyzing transaction flows and entity movements
Comprehensive Results: Achieves 45-78% reduction in entity count through clustering

Installation#

Prerequisites#

Python 3.7+
Apache Spark
Google Colab (for the provided notebook) or local Jupyter environment
Sufficient storage space for blockchain data

Dependencies#

Install the required packages:

pip install pyspark
pip install PyDrive
pip install wget
pip install pyvis
pip install streamlit
pip install networkx
pip install matplotlib

Setup#

Clone the repository:

git clone https://github.com/VincenzoImp/Bitcoin-Address-Clustering.git
cd Bitcoin-Address-Clustering

Configure Spark context with appropriate memory settings:

conf = SparkConf()\
    .set('spark.executor.memory', '50G')\
    .set('spark.driver.memory', '50G')\
    .set('spark.driver.maxResultSize', '50G')\
    .set("spark.driver.cores", "10")\
    .set("spark.sql.analyzer.maxIterations", "100000")

Usage#

Basic Usage#

Set Global Constants:

start_block = 0
end_block = 115000

Download Dataset:

v_path, e_path, a_path, d_path = download_dataset(start_block, end_block, DATA_DIR, spark, True)

Generate Transaction Graph:

nx_graph = generate_nx_graph(v_df, e_df, graph_path, start_block, end_block, True)

Apply Clustering Algorithm:

clustered_addresses = address_clustering(nx_graph, a_df, known_tx_df, spark, start_block, debug=True)

Web Application#

Launch the interactive web interface:

streamlit run app.py 0 115000

Project Structure#

Bitcoin-Address-Clustering/
├── dataset/
│   └── blocks-0-115000/
│       ├── vertices-0-115000/
│       ├── edges-0-115000/
│       └── addresses-0-115000/
├── app/
│   ├── app.py
│   ├── Bitcoin.png
│   └── bitcoin-img.svg
├── Bitcoin_Address_Clustering.ipynb
└── README.md

Methodology#

The project follows a systematic approach:

Data Collection: Extract transaction data from Bitcoin blockchain via Blockchain.info API
Graph Construction: Build directed graphs with transactions as nodes and UTXOs as edges
Heuristic Application: Apply multiple clustering heuristics in sequence
Result Analysis: Evaluate clustering effectiveness and entity reduction
Visualization: Generate interactive graphs for cluster exploration

Heuristics Implemented#

Simple Heuristics#

Satoshi Heuristic: Groups addresses from early coinbase transactions (blocks < 19,500) as likely belonging to Satoshi Nakamoto
Coinbase Transaction Mining Address Clustering: Assumes all output addresses from coinbase transactions belong to the same miner
Common-Input-Ownership: Groups all input addresses in multi-input transactions as belonging to the same entity
Single Input/Output: Treats single input, single output transactions as address movements within the same entity
Consolidation Transaction: Groups addresses in transactions with multiple inputs and single output

Advanced Heuristics#

Payment Transaction Analysis: Identifies payment transactions with change addresses
Change Address Detection: Uses multiple sub-heuristics:
- Same address in input and output
- Address reuse patterns
- Unnecessary input analysis
- New address identification
- Round number detection
Mixed Transaction Recognition: Identifies and handles CoinJoin transactions using taint analysis

Results#

Clustering Effectiveness#

Initial Addresses: ~1,000,000 unique addresses
After Clustering: ~550,000 entities (45% reduction)
With Small Cluster Assumption: ~220,000 entities (78% reduction)

Entity Distribution#

The clustering reveals a power-law distribution of entity sizes:

Most entities contain 1-2 addresses
Few large entities contain hundreds of addresses
Largest clusters likely represent exchanges or major services

Web Application#

The project includes a Streamlit-based web interface that allows users to:

Input Bitcoin addresses for clustering analysis
Visualize transaction graphs with cluster highlighting
Explore entity relationships and transaction flows
Download clustering results and statistics

Features:#

Interactive network visualization using PyVis
Real-time address clustering
Detailed transaction information
Export capabilities

Use Cases#

1. Entity Movement Visualization#

Track how funds move between addresses belonging to the same entity:

address = '115uADbwcLhfKeWJzy7EHjSWjn3dpHK1vZ'
cluster_graph = visualize_entity_movements(address)

2. Chain Analysis Queries#

Answer specific questions about blockchain activity:

“How many unique miners were active before 2011?”
“What’s the largest entity by address count?”
“Which entities show mixing behavior?“

3. Research Applications#

Academic research on Bitcoin privacy
Compliance and AML investigations
Cryptocurrency forensics
Network analysis studies

Data Sources#

Blockchain Data: Blockchain.info API
Block Range: Genesis block (0) to block 115,000
Time Period: January 2009 to February 2011
Transactions: ~400,000 transactions analyzed

Performance Considerations#

Memory Requirements: 50GB+ RAM recommended for full dataset
Processing Time: Several hours for complete clustering
Storage: ~10GB for preprocessed datasets
Scalability: Designed for distributed processing with Spark

Limitations#

Privacy Techniques: Advanced privacy methods (CoinJoin, mixers) can reduce clustering effectiveness
False Positives: Heuristics may incorrectly group unrelated addresses
Temporal Scope: Analysis limited to early Bitcoin history (2009-2011)
Data Availability: Depends on external API availability

Contributing#

Contributions are welcome! Please feel free to submit pull requests, report bugs, or suggest new features.

Development Guidelines#

Follow PEP 8 style guidelines
Add comprehensive docstrings
Include unit tests for new heuristics
Update documentation for new features

Future Enhancements#

Extended Block Range: Support for more recent blockchain data
Advanced Clustering: Integration of machine learning approaches
Real-time Analysis: Live blockchain monitoring capabilities
Privacy Metrics: Quantitative privacy assessment tools

References#

Satoshi Nakamoto’s Bitcoin Whitepaper
“A Fistful of Bitcoins” - Meiklejohn et al.
“An Analysis of Anonymity in the Bitcoin System” - Reid & Harrigan
Blockchain.info API Documentation

License#

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments#

Bitcoin Core developers
Apache Spark community
Blockchain.info for API access
Academic research community for heuristic development

Note: This tool is intended for research and educational purposes. Users should comply with applicable laws and regulations when analyzing blockchain data.