Mom !! Where is my Black T-shirt?
Have you faced a situation where you are not able to find your things and when you ask mom she magically finds it for you ? Well she may be using consistent hashing technique to search for your things :D ( My Mom is a gold medalist :D ). Let’s discuss on hashing and why it is important for mission critical applications.
When you think deeply no matter how many fancy technologies you use like Kafka, redis cache, NoSql, MySql, CDN etc, at the end of the day all they do is store the data and retrieve them. So simple right !!! Thankfully it is not so simple else i would be out of my job. As the world is evolving very fast (Hello Aliens !!! ;)) need for storing more and more data is increasing. But just storing doesn’t suffice, you also need to fetch the saved data very fast. There are quite a few fancy techniques which help in searching. Hashing is one such beautiful technique which helps to search for a very small time in computer science terminology we call it in O(1) time ( **Showing Off** )
What does hashing mean ?
Hashing is technique in which we convert any input to set range of outputs using a hashing function. Hashing uniquely identify a specific object from a group of similar objects. We use hashing unknowingly in our real life as well for example to passport number can be used to get information of any citizen in a country, ISBN can be used to get related information to a book, to get a spoon we always go to kitchen etc.
As in the example each citizen information like first name, last name, address, criminal history, date of birth is mapped to passport number by a ministry of external affairs. And when government whenever in need can fetch the information using the passport numbers.
What are good hashing functions ?
There are gazillion number of ways in much the our government could have assigned passport number to its citizens
By first letter in the name. Key space is 26 because of number of english alphabets are so much ( unless you come up with new alphabet )
By the state to which the citizen belongs
But do you see any obvious issues
- If we use first letter in the name as a hashing function then the distribution will not be uniform for example there will be lot of people with names starting with ‘S’ but there will very few people with names starting with ‘I’
- In either of the above approach there will be lot of collisions i.e several people will belong to same bucket. Consider this hypothetical scenario if government has used state as identification then to Avinash from Andhra they need to search among 49.67 million people.
So good hash functions are the ones which are
- Easy to compute. I do not have to solve a differential equation to identify a person
- Uniform distribution i.e people should be distributed among the key space equally
- Minimize collisions: Collisions occur when people have same first letter in the name
Okay are we done ? No
The above approaches are not horizontally scalable. What do you mean ?
Let’s suppose government has used as state as a identification. We all know that telangana was a new state. When this happened everything has to be migrated like state in address should be replaced with telangana from Andhra pradesh, ‘AP’ in the number plate should be replaced with ‘TS’ etc. But we still see vehicle having ‘AP’ number plate, sometimes i still write Andhra Pradesh as state in my address. It will probably take some years to get used to it so that is what is called it is not horizontally scalable
So let’s jump into computer science
For suppose I’m starting a new school with 12 students. As in every school each student will be assigned a roll number starting from 0. I want my school to be advanced and want to save each every student activity daily information like when he came to school, what less he learnt in first period, when did he leave the school etc. Since it is a relatively new school i don’t have much money i bought 3 machines and will store details of each student in these three machines. I have named these three machines as Machine-0, Machine-1, Machine-2 so when i need to store a student with roll number i will divide the roll number by 3 and whatever remainder i get then that is the machine on to which i will store the student details which looks something like this
Now students are paying regular fees and also the information related to each student is also growing as day passes. So now i’m running out of space in each machine and hence will buy a new machine like you buy external memory card when you run out of phone memory.
Since i have added new machine to distribute the load i will perform similar strategy as before but instead of dividing by 3 i will divide it by 4 and whatever will be remainder i will store that roll number information in that machine which looks something like this.
Now see there is problem i had to reshuffle the information. Total number of shifts is 6 which is not good
After some days because i was dumb because of which some clever guy installed a virus on machine 2 and that caused problems and that machine died. Now i’m down to three machines and will perform reshuffling using %3 again
Data Distribution — Avoiding “Data Hot Spots” in Cluster:-
We cannot expect a uniform distribution of data coming in all the time. There may be many more keys whose hash value maps to server number 1 than any other servers, in which case server number 1 will become a hotspot for keys.
I’m sad and was crying
One day superman comes and say’s
Enough drama …
What exactly is Consistent Hashing?
We need a distribution scheme that does not depend directly on the number of machines, so that, when adding or removing machines, the number of keys that need to be relocated is minimized. Consistent hashing facilitates the distribution of data across a set of nodes in such a way that minimizes the re-mapping/ reorganization of data when machines are added or removed. Here’s how it works:
Consistent Hashing is a distributed hashing scheme that operates independently of the number of servers or objects in a distributed hash table by assigning them a position on a hash ring. This allows servers and objects to scale without affecting the overall system. Here’s how it works:
- Creating the Hash Key Space: Consider we have a 12 students the roll numbers will be in range [0,12). We can represent this as an sequence of numbers with 12 slot.
2. Representing the hash space as a Ring: Imagine that these roll numbers are placed on a ring something which looks like a analog clock
3. Placing servers on the HashRing: I got the damaged servers repair and so i have four servers and i will place something like this
4. Determining Placement of Keys on Servers: To find which server an roll number goes on, all we do is travel clockwise on the ring from the point where the roll number is mapped to until we find the first server. Once we find the first server traveling clockwise on the ring, we save the student data on that machine which ends up something like this
5. Adding a server to the Ring: Lets suppose i got a new machine and added that machine on to the ring something like below
After reshuffling the data it looks something like this
See we just moved 10 and 11 data from one machine to another. Minimal migration
6. Removing a server from the ring: Suppose when i had four machines one machine got attacked by virus and got spoiled
I removed that spoiled machine from ring and after reshuffling it looks something like this
See we only had to migrate 10, 11, 0 roll number students data from one machine to another
Hence we can say that consistent hashing successfully solves the horizontal scalability problem by ensuring that every time we scale up or down, we do not have to redistribute all the keys.
Now let us talk about the second problem of non-uniform distribution of data across servers.
To ensure object keys are evenly distributed among servers, we need to apply a simple trick: To assign not one, but many labels to each server on the hash ring.
As the number of replicas or virtual nodes in the hash ring increase, the key distribution becomes more and more uniform.
And that is what we do at dynamodb at very very very very high level. Interested in joining ?