Hash Table in C++

Hash table uses hashing to implement a data structure that maps a key to an index so the value can be stored in the corresponding location (a bucket) in a sequential data structure like an array.

Quick question: why linked-list is not a good way to store buckets in a hash table? (Because the most frequent usage is to access by index, which is slow with linked-list data structure)

  • Requirements

    • Keys are unique (no duplicate keys)

    • Keys are immutable

  • Properties

    • Fast lookup, insert, and delete \(\Theta(1)\)

    • Unordered

    • Redundant storage to reduce collision and ensure performance

Hash function for hash tables

  • \(h(key)\) a special type of hash function used by hash table

  • Maps a key of a certain type to an integer in a certain range \(1, 2, \ldots, size - 1\), where \(size\) is the size of the hash table

  • Modulo arithmetic

  • Considerations

    • Deterministic

    • Fast to compute

    • Uniform distribution

Hash Table Applications

  • Hash map

    • Implementation of the Map ADT using hash table

    • A.k.a dictionary, associative array, symbol table, etc.

    • A collection of key-value pairs

    • Key is unique and immutable

    • Behaviors

      • Set a key-value pair

      • Get the value of a key (key lookup)

      • Delete a key-value pair

  • Hash set

    • Implementation of the Set ADT using hash table

    • A collection of unique elements

    • A special case of key-value pairs when no value is present

    • Only store keys (or say keys themselves are considered the values)

    • Keys are unique and immutable

    • Behaviors

      • Add a key

      • Remove a key

      • Check if a key exists

      • Set operations: Union, intersection, difference, subset, etc.

Collision resolution

Chaining

  • A.k.a separate chaining

  • Each bucket can store multiple key-value pairs

    • Linked-list

    • Dynamic array (vector)

  • unlimited storage space

  • Performance degradation when over-filled

  • C++ implementation

    • vector<vector<pair<keyType, valueType>>> hashTable

      Vector of vectors of pair objects.

    • vector<LinkedListNode<keyType, valueType> *> hashTable

      Vector of pointers to linked list.

    • LinkedListNode<keyType, valueType> ** hashTable

      One dimensional dynamic array of pointers to a linked list.

      1template <typename keyType, typename valueType>
      2struct LikedListNode {
      3  keyType key;
      4  valueType value;
      5  LinkedListNode * next;
      6  LinkedListNode * prev;
      7};
      

Open addressing

  • find another available bucket to store the value when a collision occurs

  • probing sequence: the sequence of the buckets to check when collision happens

  • Linear probing:

    \((h1(key) + C_{1} * i) \% size, i = 0, 1, 2, 3 \dots size - 1\)

  • quadratic probing:

    \((h1(key) + C_{1} * i^2) \% size, i = 0, 1, 2, 3 \dots size - 1\)

  • double hashing probing:

    \((h1(key) + C_{1} * i * h2(key)) \% size, i = 0, 1, 2, 3 \dots size - 1\)

  • \(h1(key)\) is the primary hash function

  • \(h2(key)\) is the secondary hash function

  • \(C_{1}\) and \(C_{2}\), etc. are constants

  • lookup/insertion/removal implementation

    • each bucket keeps both key and value

    • differentiate empty-since-start vs empty-after-removal

  • C++ implementation

    • Bucket<keyType, valueType> * hashTable

      One dimensional dynamic array of bucket objects.

    • vector<Bucket<keyType, valueType>> hashTable

      Vector of bucket objects.

      1template <typename keyType, typename valueType>
      2struct Bucket {
      3  keyType key;
      4  valueType value;
      5  bool isEmpty = true;
      6  bool isDeleted = false;
      7};
      

Chaining Hash Map Example

 1class HashMap1 {
 2  struct Node {
 3    int key;
 4    int value;
 5    Node *next;
 6  };
 7  int capacity;
 8  int size;
 9  Node **buckets;  // 1D array of pointers to linked lists
10  int hashFunction(int key) const;
11public:
12  HashMap1(int capacity = 101);  // prime number
13  HashMap1(const HashMap1 &other);
14  HashMap1 &operator=(const HashMap1 &other);
15  ~HashMap1();
16  void resize(int newCapacity); // rehashing
17  void put(int key, int value);
18  bool remove(int key);
19  int get(int key) const;
20  int size() const;
21};

Open addressing example

 1class Bucket {
 2  int key;
 3  int value;
 4  bool emptySinceStart;
 5  bool emptySinceRemoval;
 6 public:
 7  Bucket();
 8  void set(int key, int value);
 9  void remove();
10  bool isEmptySinceStart() const;
11  bool isEmptySinceRemoval() const;
12  bool isEmpty() const;
13  int getKey() const;
14  int getValue() const;
15};
16
17// HashMap
18// Linear probing
19class HashMap {
20  int capacity;
21  Bucket *buckets;  // 1D array of buckets
22  int hashFunction(int key);
23  // probing algorithm
24  int probe(int hash, int probed);
25 public:
26  HashMap(int capacity = 100);
27  HashMap(const HashMap &other);
28  HashMap &operator=(const HashMap &other);
29  ~HashMap();
30  bool set(int key, int value);
31  bool remove(int key);
32  int lookUp(int key);
33};

Hash Table Glossary

Key (hash table)

It is the unique identifier to locate the data stored in a hash table. The input to the hash function to calculate the index of the buckets in the hash table. The key is unique and immutable.

Value (hash table)

The data associated with the key in the hash table.

Hash function (hash table)

A special type of hash function used by hash table. It maps a key of an arbitrary type to an index to be used to refer to a bucket in the hash table.

Bucket (hash table)

The storage location in the hash table where the value is stored. It is usually an array element.

Collision (hash table)

A collision occurs when multiple keys result in the same index in the array. It is a common issue to be solved routinely in hash tables.

Collision resolution (hash table)

The approach to resolve collisions in hash tables. It is a critical part of the hash table implementation. There are two main approaches: chaining and open addressing.

Load factor (hash table)

The ratio of the number of keys to the number of buckets in the hash table. It is used to measure the health of a hash table and determine when to resize the hash table.

Rehashing (hash table)

The process of resizing the hash table to a new size when the load factor exceeds a certain threshold. It is used to maintain the performance of the hash table.

Chaining

A collision resolution approach in which every element in an array is a variable-length list of elements.

Open addressing

A collision resolution approach in which an empty bucket elsewhere in the hash table is used when collision happens

Probing

The mechanism to determine the next index of bucket to check in the open addressing collision resolution approach

Hash map

An implementation of the Map ADT based on the hash table data structure

Hash set

An implementation of the Set ADT based on the hash table data structure