Consistent Hashing: A Practical Example for Enhanced Distributed Systems in Java

Consistent hashing is a widely used technique that addresses common challenges in distributed systems, such as uneven data distribution and load balancing across multiple nodes. In this blog post, I will present a practical example to illustrate the concepts of consistent hashing and virtual nodes, and discuss their applications in various domains.

Consistent Hashing: A Practical Example Imagine a distributed caching system with three nodes, A, B, and C. We want to distribute the cache data across these nodes evenly and ensure minimal data movement when nodes are added or removed. To achieve this, we use consistent hashing.

In our example, let's assume we have four data items with keys k1, k2, k3, and k4. We use a hash function to map both the keys and the nodes onto a fixed-size hash ring. The hash function could be something like Java's built-in hashCode() method or a more advanced hashing algorithm like MurmurHash.

For simplicity, let's say our hash ring's range is [0, 999]. We map the keys and nodes to the following positions:

Node A: 100
Node B: 400
Node C: 700
k1: 150
k2: 350
k3: 600
k4: 900

To determine the responsible node for each key, we follow the hash ring clockwise from the key's position until we find a node. In our example:

k1 (150) is stored on Node A (100).
k2 (350) is stored on Node B (400).
k3 (600) is stored on Node C (700).
k4 (900) is stored on Node A (100), as the ring wraps around.

Now, if we add a new node (Node D at position 250) or remove an existing one (e.g., Node B), only a small fraction of the data needs to be redistributed. For instance, when adding Node D, only k1 would be moved from Node A to Node D.

Applications of Consistent Hashing Consistent hashing is widely used in various domains, including but not limited to:

Distributed caching systems, like Memcached and Redis, where it enables efficient data distribution and cache lookups.
Distributed databases and NoSQL databases, such as Cassandra, where it helps distribute data evenly and simplifies horizontal scaling.
Load balancing in web services and content delivery networks (CDNs), where it ensures even distribution of requests among available servers.
Distributed file systems, like the Hadoop Distributed FileSystem (HDFS), where it helps balance the storage load across multiple nodes.

Improving Consistent Hashing with Virtual Nodes: A Practical Example In some cases, consistent hashing can still result in imbalanced data distribution. Virtual nodes can further improve this situation. In our example above, let's assume Node A is assigned three virtual nodes: A1, A2, and A3. Each virtual node is assigned a unique position on the hash ring using a modified hash function that incorporates both the physical node's identifier and the virtual node's index.

Let's say the virtual nodes are assigned the following positions:

A1: 100
A2: 300
A3: 500
B: 400
C: 700

Now, our data distribution is as follows:

k1 (150) is stored on A1 (100).
k2 (350) is stored onA2 (300).
k3 (600) is stored on A3 (500).
k4 (900) is stored on Node C (700).

As you can see, the addition of virtual nodes results in a more balanced data distribution across the nodes.

Advantages of Virtual Nodes

Improved load balancing: Virtual nodes allow for better distribution of data and load across the nodes in the system, reducing the likelihood of hotspots.
Simplified node addition and removal: When a node is added or removed, only its associated virtual nodes need to be reassigned, reducing the amount of data that must be moved and minimizing the impact on the system.
Better handling of heterogeneous nodes: In cases where nodes have varying capacities, you can assign a different number of virtual nodes to each physical node to accommodate their different capabilities.

Conclusion Consistent hashing is a powerful technique for distributing data evenly in distributed systems and minimizing data movement when nodes are added or removed. It is used in various domains, such as distributed caching systems, databases, load balancing, and distributed file systems. Virtual nodes can further enhance consistent hashing by improving load balancing and facilitating the handling of heterogeneous nodes.