My rabbit hole into consistent hashing began with the paper 'Scaling Memcached at Facebook'. One topic led to another, and here I am discussing BitTorrent. Consistent Hashing is a captivating subject, diving into it will make you fall in love with software engineering. It has numerous applications: Akamai uses it to manage Internet traffic, Amazon incorporates it into AWS features like Dynamo, and it's essential in Peer-to-Peer networking. But for now, let's focus on BitTorrent.
We've all used BitTorrent, right? And of course, not for pirating 😉. The inner workings of BitTorrent are incredibly cool - it's truly a marvel of software engineering. In this blog, we'll take a bird's-eye view of BitTorrent's workings and architecture, focusing on how BitTorrent utilizes Distributed Hash Tables and Consistent Hashing. So, let's dive right in!
Peer-to-Peer networking
Conceptually, Peer-to-Peer computing offers an alternative to the traditional client/server architecture. In the conventional model, there is one server and multiple clients. However, in Peer-to-Peer networking, participants share a portion of their resources for file sharing.
Introduction to BitTorrent
BitTorrent is a Peer-to-Peer protocol/technology that makes sharing a file among users easier and more efficient. This is done by utilizing the upload capacity of the peers downloading the file.
If you've used a torrent client, you've likely encountered terms like "seeds", "peers", and "leeches". To better understand how torrenting works, let's define these terms.:
Torrent: The metaInfo(Torrent) file contains essential information like hashing information, size, filename, URL of the 'tracker', etc.
Seed: if a user has downloaded the whole file, and is now sharing the file with others, that user is called a seed.
Leecher (or Peers): A downloading user with nothing or parts of the file, is called a leecher.
Tracker: The tracker maintains a record of peers currently downloading a file and assists them in connecting with each other. We'll explore how trackers utilize consistent hashing to locate peers later in this blog.
The Architecture
Here's the process: The publisher (the individual sharing the file) first creates a .torrent file. This torrent includes metadata about the file being shared, such as the filename, size, hashing information, and the tracker's URL. I opened a .torrent file using an online torrent editor, and here's a sample data of what it stores:
As mentioned earlier, a tracker records the peers currently downloading the file. It uses a straightforward protocol over HTTP to enable communication between the tracker and the users. At first, the user sends the tracker details about the file being downloaded and the ports they're listening on. The tracker then responds with a list of other users downloading the same file, along with contact information for those users.
But there's a catch, at least one downloader must have the complete file before the original publisher can stop seeding. Once the seed has shared the entire file, it can stop uploading, and the download will continue for other leechers.
I found this cool video on YouTube that visualizes how files are exchanged between leechers and seeders and trackers: The Power of BitTorrent visualized - Bittorrent in 2023 isn't dead (youtube.com)
Algorithms
BitTorrent follows various algorithms, such as the piece selection algorithm and resource allocation algorithms. However, we won't be covering those in this blog. I'll provide references for these algorithms at the end if you're interested in exploring them further. Instead, we'll focus on Distributed Hash Tables and Consistent Hashing. So, let's dive right in.
There was a problem with the original BitTorrent implementation. What happens if the tracker goes offline? None of the peers would be able to download the file, which means we're too dependent on a single server.
The Solution:Using multiple trackers for the same torrent. This way, leechers can connect through different trackers, ensuring that file transfers can continue even if one tracker fails.
However, this solution introduces another challenge: the cost of servers. Hosting trackers for millions of users daily can be quite expensive. A study from 2013 indicated that BitTorrent had between 15 to 27 million daily users. Now, imagine hosting multiple trackers for that many users.
In May 2005, version 4.1 was released by Cohen (BitTorrent creator), which solved this issue:
Decentralized tracker
A decentralized tracker is a protocol that utilizes multiple trackers instead of just one, as mentioned earlier. In this system, rather than hosting trackers on a server, the clients themselves function as mini-trackers. This means that if a torrent has 1,000 clients, there are 1,000 trackers, with each client acting as a tracker.
Smart Right? This solution is based on Distributed Hash Tables(DHTs).
Distributed Hash Tables
DHTs, or Distributed Hash Tables, are essentially hash tables distributed across multiple nodes. They work by finding values based on key values, with the main difference being that storage and lookup processes are spread across the nodes.
DHTs focus on 3 main characteristics:
Decentralization
Scalability
Fault Tolerance
While decentralization and scalability are straightforward concepts, one important characteristic to discuss is fault tolerance. In DHTs, where multiple nodes store and manage key/value pairs, the system must handle the departure of any single node without requiring a complete data reorganization. This is where consistent hashing plays a crucial role.
Keyspace Partitioning
I recommend reading one of my other blogs, The Art of Consistent Hashing, to know more about consistent hashing. In DHTs, each node in the network is given a unique identifier consisting of m bits. To map keys to these nodes, we use consistent hashing. Picture the nodes arranged in a circle. To assign a key to a node, consistent hashing hashes the key and places it on the circle. We then move clockwise around the circle until we find the appropriate node.
When a new node joins the circle, it takes over the keys that were previously assigned to the node immediately following it. Conversely, if a node leaves the network, its keys are reassigned to the node that follows it. The important point here is that, on average, only n/m
keys need to be reassigned, where n
is the total number of keys and m
is the number of nodes.
In BitTorrent, when you receive a torrent, you need to find out which node has the tracker for that file using DHT. Since trackers are distributed among clients, each client is assigned a unique hash key. Imagine these clients arranged in a circle. To locate the node with the tracker for a specific file, we hash the torrent ID using consistent hashing and place it on the circle. Then, we move clockwise around the circle until we reach the successor node that holds the tracker.
Thank you for reading, and I hope this blog has shed some light on the fascinating world of distributed hash tables and their role in peer-to-peer networks like BitTorrent. Feel free to leave any questions or comments, and stay tuned for more discussions on distributed systems and algorithms.