Hash Table

DinS          Written on 2017/4/26

1. Visually, what is a hash table(散列)?

See featured image above.

Hash table is a very interesting data structure. It is fundamentally different from other common data structures in that it does not require the data to be comparable. This opens the door to new possibilities, such as constant time of all operations.

2. Why search, insert and delete all in O(1)?

Hash table is basically a vector, but this vector is different from the original vector. Hash table follows call-by-value. More concretely, it uses a hash function to transform the value of data to a specific number in O(1) time, and this number is used as the rank in vector. In other words, where the data is stored has nothing to do with other data, but only the value of itself. Whether access, insert or delete, we can get to the data and perform the operation regardless of the whole size of hash table. That means O(1) time for all operations!

More technically, all the data rest in an imaginary space called address space with size R. The vector, which is the real container of hash table, is of size M. Each element in vector is called bucket. Given any data in address space, we apply hash function to the value of the data to generate a number K. As long as K is smaller than M, we can put the data in vector[K]. If we call the value of data Key, then we access the data using Vector[hash(Key)], and that is O(1).

If we want to insert a new value Key2, we simply put the data in vector[hash(Key2)]. No need to move other data. That is O(1). This works the same for deletion, O(1).

All work well provided that there’s no collision. If Key1 ≠ Key2 but hash(Key1) = hash(Key2), we call it a collision. If collision happens we can’t simply ignore it. Hashing is mainly about designing hash function that minimizes collision chance.

3. Hash function

(1)Four criteria of a good hash function.

Determinism: the same key always generates the same number.

Efficiency: expected O(1) time.

Surjection: cover the whole hashing address as full as possible.

Uniformity: the number generated should be as uniform as possible to avoid clustering.

(2) Division method

This is the most common hash function.

Hash(key) = key mod M

Diving by M is always a necessity no matter what hash function you use because if you don’t, the number generated may be bigger than the size of vector, and that’s an invalid position.

Choosing M to be a big enough prime number is always a good idea too. This will greatly reduce collision chance.

(3) MAD method

Dividing M alone is not good enough. We can do better and define hash function as

Hash(key) = (a * key + b) mod M

MAD stands for multiply-add-divide. If a and b are well-chosen, the collision chance can be even lower. As to what a and b should be, you need to do some experiments.

(4) mid-square method

We can take several digits in the middle of key^2 as the number.

For example, 123^2 = 15129. If we keep the three digits in the middle, we get

Hash(123) = 512.

As to why choose digits in the middle, we want all digits in the number to play a role. This can reduce collision chance.

(5) folding method

We can part the key into several parts and add those parts together.

For example, cut 123456789 into three parts, each with three digits. Then we get

Hash(123456789) = 123 + 456 + 789 = 1368

(6) Random number method

Simply leave this problem to the designers of random number generator. We just take the benefit of their works.

Hash(key) = rand(key)

Warning: different platforms have different random number generators. Use this method at your own risk.

(7) Polynomial method

Not all keys are numbers. In many occasions we need first transform key into numbers and then apply hash function. In other words hash(hashcode(key)) is the procedure.

If the key is a string, we can apply polynomial method.

Hash(s=x0x1…xn) = x0 * an-1 + x1 * an-2 + xn-2 * a1 + xn-1

Experiments show that a = 33, 37, 39, 41 are all good choices.

4. Solving collision

No matter how good our hash function is, collision can’t be wiped out. We should always make some arrangements in advance in times of need.

(1) Multiple slots

We put some slots in each bucket. If collision happens, we have slots to store the data.

But it is hard to come up with a good number of slots in advance, and slots waste space. All in all this is not a good choice.

(2) Separate chaining

There’s a list in each bucket. Since list can grow infinitely, we can solve collision once for all. But the cost is that operation time may grow at the same time. In the worst case we can’t expect O(1) time.

(3) Open addressing

Have a collection of backup buckets in advance. These buckets consist of a probing chain. If collision happens, probe through the chain to search for a free bucket. For this method to work efficiently, load factor should be less than 50%. In general this is a good method for solving collisions.

Please notice that when using probing chain, deletion operation is a bit different. If we simply delete the bucket, the probing chain will break up and other buckets are lost. We should use the technique of lazy removal, which is simply put a tag on the bucket instead of actually deleting the bucket. When searching, if we meet buckets with lazy removal tag, we continue probing. When inserting, if we meet buckets with lazy removal tag, we treat it as an empty bucket and put data there.

There are many different probing strategies, I’ll briefly discuss here.

i. linear probing.

Rule: once collide, probe the next bucket.

(Hash(key) + i) mod M   for i = 0, 1, 2, 3…

This strategy is very intuitive. No need for extra room and all buckets can be used. The downside is that previous collisions will add up. For example see this.

We have an empty hash table. Suppose hash(key) = key % 5.

If we insert {0, 1, 2, 5}, only 5 will cause a collision and activate linear probing.

But if we insert {5, 0, 1, 2}, the 0, 1, 2 will all cause collision and activate linear probing.

ii. quadratic probing

Rule: once collide, probe the next bucket with a distance of square.

(Hash(key) + i2) mod M   for i = 0, 1, 2, 3…

This strategy aims to solve the collision-add-up and clustering problem of linear probing. By increasing at a quadratic speed, data will jump out of collision zones. But the downside is that there are some buckets that will never be used. This can be proved in math. Besides, if load factor is less than 50%, quadratic probing will surely come to an end. There’s no worry about that.

iii. double way quadratic probing

Rule: once collide, probe the next bucket with a distance of square alternating from positive and negative.

(Hash(key) +/- i2) mod M   for i = 0, 1, 2, 3…

This strategy is an improved version of quadratic probing. If M is prime number that belongs to the 4k + 3 type, math can prove that all buckets can be accessed using this strategy.