Hash Table in C++¶
Hash table uses hashing to implement a data structure that maps a key to an index so the value can be stored in the corresponding location (a bucket) in a sequential data structure like an array.
Quick question: why linked-list is not a good way to store buckets in a hash table? (Because the most frequent usage is to access by index, which is slow with linked-list data structure)
Requirements
Keys are unique (no duplicate keys)
Keys are immutable
Properties
Fast lookup, insert, and delete \(\Theta(1)\)
Unordered
Redundant storage to reduce collision and ensure performance
Hash function for hash tables¶
\(h(key)\) a special type of hash function used by hash table
Maps a key of a certain type to an integer in a certain range \(1, 2, \ldots, size - 1\), where \(size\) is the size of the hash table
Modulo arithmetic
Considerations
Deterministic
Fast to compute
Uniform distribution
Hash Table Applications¶
Hash map
Implementation of the Map ADT using hash table
A.k.a dictionary, associative array, symbol table, etc.
A collection of key-value pairs
Key is unique and immutable
Behaviors
Set a key-value pair
Get the value of a key (key lookup)
Delete a key-value pair
Hash set
Implementation of the Set ADT using hash table
A collection of unique elements
A special case of key-value pairs when no value is present
Only store keys (or say keys themselves are considered the values)
Keys are unique and immutable
Behaviors
Add a key
Remove a key
Check if a key exists
Set operations: Union, intersection, difference, subset, etc.
Collision resolution¶
Chaining¶
A.k.a separate chaining
Each bucket can store multiple key-value pairs
Linked-list
Dynamic array (vector)
unlimited storage space
Performance degradation when over-filled
C++ implementation
vector<vector<pair<keyType, valueType>>> hashTable
Vector of vectors of pair objects.
vector<LinkedListNode<keyType, valueType> *> hashTable
Vector of pointers to linked list.
LinkedListNode<keyType, valueType> ** hashTable
One dimensional dynamic array of pointers to a linked list.
1template <typename keyType, typename valueType> 2struct LikedListNode { 3 keyType key; 4 valueType value; 5 LinkedListNode * next; 6 LinkedListNode * prev; 7};
Open addressing¶
find another available bucket to store the value when a collision occurs
probing sequence: the sequence of the buckets to check when collision happens
Linear probing:
\((h1(key) + C_{1} * i) \% size, i = 0, 1, 2, 3 \dots size - 1\)
quadratic probing:
\((h1(key) + C_{1} * i^2) \% size, i = 0, 1, 2, 3 \dots size - 1\)
double hashing probing:
\((h1(key) + C_{1} * i * h2(key)) \% size, i = 0, 1, 2, 3 \dots size - 1\)
\(h1(key)\) is the primary hash function
\(h2(key)\) is the secondary hash function
\(C_{1}\) and \(C_{2}\), etc. are constants
lookup/insertion/removal implementation
each bucket keeps both key and value
differentiate empty-since-start vs empty-after-removal
C++ implementation
Bucket<keyType, valueType> * hashTable
One dimensional dynamic array of bucket objects.
vector<Bucket<keyType, valueType>> hashTable
Vector of bucket objects.
1template <typename keyType, typename valueType> 2struct Bucket { 3 keyType key; 4 valueType value; 5 bool isEmpty = true; 6 bool isDeleted = false; 7};
Chaining Hash Map Example¶
1class HashMap1 {
2 struct Node {
3 int key;
4 int value;
5 Node *next;
6 };
7 int capacity;
8 int size;
9 Node **buckets; // 1D array of pointers to linked lists
10 int hashFunction(int key) const;
11public:
12 HashMap1(int capacity = 101); // prime number
13 HashMap1(const HashMap1 &other);
14 HashMap1 &operator=(const HashMap1 &other);
15 ~HashMap1();
16 void resize(int newCapacity); // rehashing
17 void put(int key, int value);
18 bool remove(int key);
19 int get(int key) const;
20 int size() const;
21};
Open addressing example¶
1class Bucket {
2 int key;
3 int value;
4 bool emptySinceStart;
5 bool emptySinceRemoval;
6 public:
7 Bucket();
8 void set(int key, int value);
9 void remove();
10 bool isEmptySinceStart() const;
11 bool isEmptySinceRemoval() const;
12 bool isEmpty() const;
13 int getKey() const;
14 int getValue() const;
15};
16
17// HashMap
18// Linear probing
19class HashMap {
20 int capacity;
21 Bucket *buckets; // 1D array of buckets
22 int hashFunction(int key);
23 // probing algorithm
24 int probe(int hash, int probed);
25 public:
26 HashMap(int capacity = 100);
27 HashMap(const HashMap &other);
28 HashMap &operator=(const HashMap &other);
29 ~HashMap();
30 bool set(int key, int value);
31 bool remove(int key);
32 int lookUp(int key);
33};
Hash Table Glossary¶
- Key (hash table)¶
It is the unique identifier to locate the data stored in a hash table. The input to the hash function to calculate the index of the buckets in the hash table. The key is unique and immutable.
- Value (hash table)¶
The data associated with the key in the hash table.
- Hash function (hash table)¶
A special type of hash function used by hash table. It maps a key of an arbitrary type to an index to be used to refer to a bucket in the hash table.
- Bucket (hash table)¶
The storage location in the hash table where the value is stored. It is usually an array element.
- Collision (hash table)¶
A collision occurs when multiple keys result in the same index in the array. It is a common issue to be solved routinely in hash tables.
- Collision resolution (hash table)¶
The approach to resolve collisions in hash tables. It is a critical part of the hash table implementation. There are two main approaches: chaining and open addressing.
- Load factor (hash table)¶
The ratio of the number of keys to the number of buckets in the hash table. It is used to measure the health of a hash table and determine when to resize the hash table.
- Rehashing (hash table)¶
The process of resizing the hash table to a new size when the load factor exceeds a certain threshold. It is used to maintain the performance of the hash table.
- Chaining¶
A collision resolution approach in which every element in an array is a variable-length list of elements.
- Open addressing¶
A collision resolution approach in which an empty bucket elsewhere in the hash table is used when collision happens
- Probing¶
The mechanism to determine the next index of bucket to check in the open addressing collision resolution approach
- Hash map¶
An implementation of the Map ADT based on the hash table data structure
- Hash set¶
An implementation of the Set ADT based on the hash table data structure