Delve into HashMap: HashMap walkthrough

Let's be done with the boring definitions first, or you could just skip to the next section if you already know what a HashMap does. HashMap is a data structure that maps keys to values, duh ! It is a map based collection class, you would have thought otherwise. However, HashMap does not guarantee of the order in which data is stored. A HashMap uses a hash function to compute an index that finds the location of the bucket into which data is to be stored into or retrieved from. Woah, let's break things down.

So, before delving into the internal workings of HashMap, let's understand it's building blocks.
1. Map Entry
2. Bucket
3. Hash function

1. Map Entry:
These are the actual data holders in a HashMap. When we try to store a key value pair into HashMap, it stores the data in the form of a Node. The Node is a static class defined inside HashMap that implements Map.Entry interface.

As you can see, Node class contains not just key and value, but also the hash and a reference to the next Node. That means when we call put method in HashMap, it stores hash and next node reference along with key and value.

2. Bucket:
A bucket is a list of Nodes. The Nodes whose hash values are same fall under the same bucket. More about hash value next. HashMap is nothing but an array of buckets -> Node<K, V>[].

3. Hash function:
Each node stores a hash along with key and value. This hash value is used to determine the bucket index where the key value pairs will be stored as a Node.

From the hash function, we can infer that if the key is null, 0 is returned. That means the value for null key is always inserted in the first bucket (i.e. the bucket with index 0).

However, if the key is not null, then hashcode of the key is calculated. The higher bits (obtained by the use of unsigned right shitt operator) and the lower bits of thus obtained hash code are XORed. This will mean all the bits of hash code are involved in calculating the final hash value. XOR operation is good for shuffling bits, so it helps in spreading the storage of values more uniformly in HashMap.

Note: Index of bucket is found using ((n - 1) & h) formula (Explained later).

Before checking how the values are stored and retrieved from HashMap. Let's check the constructor of a HashMap.

HashMap has no argument constructor as well, the one we generally use. I chose the above constructor to explain two important ingredients that are used to create a HashMap. Let's tackle them one at a time.

1. initialCapacity: This looks pretty straight forward. yes, you guessed that right! This value represents the number of buckets the HashMap can have initially. The default initial capacity of HashMap is 16.

static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16

2. loadFactor: Why do we need loadFactor ? Let's suppose we have a HashMap that will always contain 16 buckets. If we have a large number of values to be stored, then multiple values will be stored in the same bucket. So, the complexity of search becomes linear, and we lose the significance of using HashMap which is to provide a constant search time. Similarly, if we create a HashMap with large number of buckets, and we only have a few values to be stored, the extra memory allocation will be wasted.

So, it becomes necessary to increase the number of buckets as and when the number of values to be stored increases.

The default loadFactor value is 0.75

static final float DEFAULT_LOAD_FACTOR = 0.75f;

loadFactor decides when the HashMap has to be re-sized so that the complexity of the get and put operations remain O(1).

So, the above value means when the number of values stored in HashMap is 0.75 times it's current capacity, the size of HashMap is doubled.

Suppose if we add 50 elements, the final capacity after adding all the values will be 128.

let's see how.

(0.75 X 16) = 12th addition -> Capacity = 16 X 2 = 32
(0.75 X 32) = 24th addition -> Capacity = 32 X 2 = 64
(0.75 X 64) = 48th addition -> Capacity = 64 X 2 = 128

Phew ε-(´・｀) ﾌ

Now, we can get into the part where we actually do operations on HashMap. Let's put some value into HashMap.

put: put internally calls a final method putVal.

Key points to observe in HashMap putVal:
1. When we call new HashMap<>(), it does not allocate the space for the bucket right away. Instead it waits till the first insertion is done. If there is no bucket list created, then allocate memory using resize().

2. If bucket has no nodes, then allocate the new value as first node in that bucket. The position of bucket where new value is to be inserted is calculated using the formula ((n - 1) & hash), where n is the number of buckets in the HashMap. This ensures that the index value does not exceed the bucket list size.

3. If there are already some nodes in the bucket, we need to traverse it and see if any node has the same key as the key we have passed.

4. The match is done based on 2 conditions.
a. hash value must match
b. the keys refer to the same object, or are equal (Object equality measured by calling equals method of the object).
So, the HashMap traverses through the bucket list one node at a time, once it finds the same key, it replaces the value for that node.

Extra note:
Why is it said that if we override equals method for an object, we must override hashCode method also ?
Let's search the answer in HashMap put implementation.
What if you override only equals method ? If you do so, hashCode method in the key class will produce different hash code values for two different keys that are equal (Common wisdom is the default implementation of hashCode returns integer representation of the memory address). This could cause the values for same keys to be stored in different buckets. This totally breaks down the purpose of using HashMap which says that a single key should always map to the same value.

Follow the below two rules for hashCode and equals methods while overriding them:
i. If two objects are equal according to equal(), then calling the hashCode method on each of those two objects should produce some hash code.

ii. It is not required that if two objects are unequal according to the equal(), then calling the hashCode method on each of the two objects must produce distinct value

5. If first node key does not match, then traverse through the each node in the bucket in search of the matching key.
If the node reaches the end of the list, i.e. p.next is null, then create and insert a new node with the given key, value and calculated hash.
One more thing you might have noticed is the call treeifyBin function. This helps in preventing linear time search of key in case multiple keys map to the same bucket. If the node count in a bucket is more than or equal to TREEIFY_THRESHOLD value, then instead of storing the nodes in a linked list, store them as a tree, so that the search takes O(logn) time.
If node finds the match, break the loop.

6. After coming out of the else condition, if e is not null, that means a matching key was already present in some bucket. So, we need to retrieve the old value and return it, and save the new value.

If onlyIfAbsent value is true, HashMap does not change the current value.
afterNodeAccess is just a callback method that you can override to do any operation post update.

14. modCount is used to monitor any modification in HashMap. If there is any change in modCount, HashMap's fail fast iterator throws concurrentModificationException. It let's you know that the HashMap's values are updated, and you might need to reiterate with new instance of iterator again to get the latest values.

After every update in the HashMap, reSize method is called if there is the size increase beyond the load factor limit as explained earlier.

If there was no matching key, return null in the end.

That is all for the put implementation.

get():
HashMap get implementation gets equally simpler. get internally calls a final method getNode. getNode retrieves the node which contains a matching key. If there is a node containing matching key, then it return the node's value, else null is returned.

The getNode method first finds the bucket in which the key could be present. Then it traverses each node in the list to find a node with a matching key. Once it finds the node with matching key, same node is returned. In all other conditions, null value is returned.

References:
http://androidxref.com/9.0.0_r3/xref/libcore/ojluni/src/main/java/java/util/HashMap.java
http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/HashMap.java

Delve into HashMap

Saturday, November 23, 2019

HashMap walkthrough

No comments:

Post a Comment