What is Load factor and Rehashing in Hashmap?

What is Load factor and Rehashing in Hashmap?


This is the famous interview question for experienced, So Let's see what it is all about.

Hashmap is very popular data structure and found useful for solving many problems due to O(1) time complexity for both get and put operation.
Before understanding Load Factor and Rehashing, It is important to understand below articles,
So please go through it if you are not aware of,


What is Hashmap & How hashmap API works?
What is Hashcode and How hashmap uses it? 
How time complexity of Hashmap Put and Get operation is O(1)?


Load Factor


When the total number of items in hashmap goes on increasing keeping the default initial capacity of hashmap 16, At one point of time, hashmap performance will start degrading and need to increase buckets for improving performance.

Load Factor is a measure, which decides when exactly to increase the hashmap capacity(buckets) to maintain get and put operation complexity of O(1).
Default load factor of Hashmap is 0.75f (i.e 75% of current map size).
You can also say, load factor is a measure "Till what load, hashmap can allow elements to put in it before its capacity is automatically increased"

Above line will make more sense with the help of an example,
Default capacity of Hashmap is 2^4 = 16 buckets.
Let say we have well implemented hashcode() method, which make sure that key-value pair will be well distributed across 16 buckets equally.

So, If there are 16 items in hashmap, then good hashcode method will distribute 1 item in each bucket. Searching for any item in this case will take only 1 look up.

Now, If there are 32 items in hashmap, then good hashcode method will distribute 2 item in each bucket. Searching for any item in this case will take maximum 2 look up.

Now, If there are 128 items in hashmap, then good hashcode method will distribute 8 item in each bucket. Searching for any item in this case will take maximum 8 look up.

If you observe, If the number of items in hashmap is doubled, still the maximum look up time in each bucket is not increasing very high and remain almost constant.

If say, the number of items goes on increasing in the map, then what will happen?
If the amount of item keeps on increasing and the number of buckets are fixed(16) then at one time, performance of hashmap will start degrading due to large number of items in each bucket.

 


Now, say if there are 5,00,000 items in hashmap, then, good hashcode method will distribute 31,250 items in each bucket. Searching for any item in this case, will require maximum 31,250 look up.

Compare to total number of items in hashmap, look up required for searching any item within bucket is very less, but still expensive, as now there are 31,250 items present in each bucket.
So in worst case it has to compare 31,250 items for both put and get operation.


Just double the total items from 5,00,000 to 10,00,000, each bucket now will have 62,500 items and this time searching a item will really hit performance.


So what is the solution for this????

Initially we was doing good, when bucket size was more(16) and total items was less. 
When total items keep growing, at one point, our performance start degrading due to much items present in each bucket. So what do you think problem is???

Problem is, keeping bucket size fixed(16), we kept on increasing the total number of items in map and that disturbed time complexity.

If we increase the total number of buckets, when total items in each bucket starts increasing, then
we will be able to keep constant number of items in each bucket and maintain the time complexity of O(1) for get and put operation. 

The decision of "When to increase the number of buckets" is decided by Load Factor.

Load Factor is a measure which decides when exactly to increase the hashmap capacity or you can say bucket capacity, so that get and put operation can still have O(1) complexity.


Default, initial capacity of the HashMap is 16 and Load factor is 0.75
So, when to increase the hashmap size is decided by product of,
(initial capacity of hashmap * Load factor of hashmap).
Lets see, when initial size of hashmap will be increased based on above forumla,initial capacity of hashmap * Load factor of hashmap =  16 * 0.75 = 12.

This represents that when 12th item(key-value pair) will come into the Hashmap,  it will increase its capacity from default 2^4 = 16 buckets to 2^5 = 32 buckets.
Hashmap grows in power of 2^n. Initially n=4 and then keep on growing when threshold limit is reached.
Another way of calculating, when hashmap size will be doubled.
When Load factor ratio (m/n) reaches 0.75 at that time hashmap increases its capacity, where n = the total size of the hash map and m = number of entries in a map.
Lets take a example,

Default bucket size size if 16. First element came in, do we need to increase the hashmap capacity is decided by,
size of hashmap / number of buckets = 1/16 =  0.0625.
Compare, 0.0625 >= 0.75 Load factor ? No. So no need to increase the map size.

11th element came in, do we need to increase the hashmap capacity, 11/16 =  0.6875
Compare 0.6875 >= 0.75 Load factor ? No. So no need to increase the map size.

12th element came in, do we need to increase the hashmap capacity, 12/16 =  0.75
Compare 0.75 >= 0.75 Load factor ? Yes. We need to increase the map size now.
It is advisable to have a load factor of around 0.75 for keeping the put and get complexity around O(1).
NOTE:
Load Factor and Initial Capacity(Number of buckets) can be configured while creation of Hashmap like shown below,
HashMap m = new HashMap(int initialCapacity, float loadFactor);

Rehashing


Rehashing is the process of re-calculating the hashcode of already stored entries (Key-Value pairs), to move them to another bigger size hashmap when Load factor threshold is reached.

When the number of items in map, crosses the Load factor limit at that time hashmap doubles its capacity and hashcode is re-calculated of already stored elements f
or even distribution of key-value pairs across new buckets.

Why Rehashing is required?


After doubling the capacity, what to do with the key-value pairs already present in buckets?
If we keep the existing key-value pairs as it is, then doubling the capacity may not help,
because O(1) complexity will be achieved only if items are evenly distributed across all buckets.

So for each existing key-value pairs, hashcode is calculated again with increased hashmap capacity as a parameter, which results in either placing the item in same bucket or in different bucket. 

When the size of hashmap is changed, the process of re-calculating the hashcode of already placed key-value pair again is known as Rehashing.
Rehashing is done to distribute items across the new length hashmap, so that get and put operation time complexity remains O(1).

NOTE:

Hashmap maintain complexity of O(1) while inserting data in and getting data from hashmap, but for 12th key-value pair, put request will no longer be O(1), because as soon as map will realize that 12th element came in, that is 75% of map is filled.

It will first double the bucket(array) capacity and then it will go for Rehash.

Rehashing requires re-computing hashcode of already placed 11 key-value pairs again and putting them at new index which requires time.
But overall time complexity provided by hashmap, which is O(1) for get and put operations, will amortize Rehashing process over long run.

Post a Comment