Part 1: Basics of Hashing

The sooner you start to code, the longer the program will take

Part 1: Basics of Hashing

Hashing is the process of converting a given key into another value. A hash function is used to generate the new value according to the mathematical function. The result of the hash function is known as the hash value. Let's understand this with the help of an example -

You have been provided a set S of numbers. You have to perform 3 kinds of
queries :

  • Check if element X is present in the set.
  • Add a new element X to this set, if not present
  • Delete an element X from this set, if present

Constraints :

  • Each element of the set is distinct
  • 0 <= X <= 10 power 5

Come on, think about the way how you will solve this problem?

Solution 1: Naive Approach

This approach is generally the first hit of our mind. Here, we will loop over the set (you can assume the set as an array) and can easily check the element, add the element and delete an element.

The complexity of the queries -

    O(N) [Linear Search]
    O(N) [Loop over all the elements in the worst case, add after the last element]
    O(N) [Create a new array excluding that element]

Solution 2: Direct addressing Scheme

Create a boolean array of the size of the constraint. Initialize all the elements as False. Take the set of given numbers and change to True the corresponding elements in the array.

Let's say the given set is S = [0, 3, 8, 10, 12]

A = [T, F, F, T, F, F, F, F, T, F, T, F, T, ................. , F] <-- dummy boolean array

The complexity of the queries -

    O(1) [check the value of Xth index of the array, if True then present else not present]
    O(1) [change the value of Xth index to True]
    O(1) [change the value of Xth index to False]

The catch here is that the initialization of this boolean array is O(N) where N is the set's length as we have to loop over the set and populate the boolean dummy array also called the Addressing table. After initialization, all required queries will take O(1) right?

Downsides of this approach -

  • Memory Limitation: In some cases, constraints can be too large to store
  • Wastage of memory: If our set S has 5 elements, we still have to initialize an array of size of the constraint
  • Limited to only positive integers: Can be used for negative integers with little modification

Solution 3: Hashing

The difference between direct addressing and hashing is instead of directly putting an element in the addressing table we will now pass the element first from the hash function (f(x)) then will store the output of that function in the addressing table.

ex. f(x) = x % 11 where x is the element of set S

Now whatever the input is the output of this hash function will be in [0,10]. So, our boolean dummy array will be of length 11

A = [F, F, F, F, F, F, F, F, F, F, F, F]
Let's take S = {1, 2, 14, 20, 99} and hash function f(x) = x % 11

Calculate the value of hash function for every element of S - 
-  f(1) = 1 % 11 = 1
-  f(2) = 2 % 11 = 2
-  f(14) = 14 % 11 = 3
-  f(20) = 20 % 11 = 9
-  f(99) = 99 % 11 = 0

Now, we know in A only index 1, 2, 3, 9, 0 will be True

So, A = [T, T, T, T, F, F, F, F, F, T, F]

All three downsides are direct addressing approach is resolved here. The only catch here is to get the hash function right

Note: The hash function f(x) = x % 11 is just an example taken for you guys to understand. It's not a good hash function as you can see the collision rate is too high. ex: for x = 3 and x = 25 this hash function will return the same output

Hashing - A hash function should follow the mathematical definition of functions (either one to one or many to one) and should be consistent with respect to their outputs. A good hash function distributes the keys as uniformly as possible

Collision - It happens when two or more different inputs to the hash function return the same output. Let's take a hash function, f(x) = x % 11

For x = 3 and x = 25 the value of f(x) is same i.e 3. This is known as a collision or a hit. Less number of hits corresponds to a good hash function.

Collision Resolution - To resolve or reduced collision certain methods are used. Two well known used algorithms are Chaining and Double Hashing Technique

This is just the prerequisite to get started on the real problems of hashing. In the next part, we will understand the String Hashing, performance load factor, and other sub-concepts that will give you a solid understanding of hashing and its real-world use case.

Hope you like this knowledge, please like or comment, cheers. Thanks