Hash maps are one of the most commonly used data structures in programming. They power everything from language interpreters to caching systems — but because they’re built into most languages, we rarely stop to think about how they actually work.

Depending on the language, they go by different names:

  • Python: dict
  • JavaScript: Object (or Map)
  • Java: HashMap
  • Go: map
  • Rust: HashMap
  • C++: std::unordered_map

Despite the naming differences, the core idea is the same:
a way to associate keys with values and retrieve them efficiently.

In this post, we’ll build up the concept of a hash map from scratch — looking at what a hash function does, how data is stored, and how operations like insert, get, and resize work internally.


Hash Function

The typical signature of a hash function looks like this:

hf(K) -> int

Where K is any key type supported by the hash function. The function maps keys to integers, ideally distributing them uniformly over a target range.


Hash Map

A simple hash map implementation needs at least:

  • A backing data structure (usually an array)
  • An integer capacity
  • A hash function
  • An integer length tracking the number of entries

Before diving into an implementation, let’s go over some terminology using this Python example:

  • K and V: Two generic type variables representing the key and value types
  • Hasher: A function that takes a K and returns an int, represented as Callable[[K], int]
  • Entry: A key–value pair, represented as tuple[K, V]
  • Bucket: A list of entries, represented as list[Entry]
K = TypeVar("K")
V = TypeVar("V")
Hasher = Callable[[K], int]
Entry = tuple[K, V]
Bucket = list[Entry]
 
class HashMap(Generic[K, V]):
    def __init__(self, capacity: int, hasher: Hasher) -> None:
        self.len = 0
        self.capacity = capacity
        self.hasher = hasher
        self.data: list[Bucket] = [[] for _ in range(self.capacity)]

For the data structure, we use separate chaining, which means each array slot holds a list (or bucket) of key–value pairs:

Array<Array<(K, V)>>

So the outer array holds buckets, and each bucket holds key–value tuples.


Insert

insert(k: K, v: V)

To insert a key–value pair:

  1. Hash the key: hf(k)
  2. Convert it to an index: index = hf(k) % capacity
  3. Scan the bucket at data[index] to check if the key already exists:
    • If it does, update the value.
    • If not, append the new pair.
def insert(self, k: K, v: V):
    index = self.hasher(k) % self.capacity
    bucket = self.data[index]
 
    for i, (key, _) in enumerate(bucket):
        if key == k:
            bucket[i] = (k, v)  # Update existing key
            return
 
    bucket.append((k, v))  # Insert new pair
    self.len += 1

Get

get(k: K) -> Option<V>

To retrieve a value:

  1. Hash the key: hf(k)
  2. Convert it to an index: index = hf(k) % capacity
  3. At data[index], scan the bucket:
    • If the key is found, return the associated value.
    • If not, return None.
def get(self, k: K) -> Optional[V]:
    index = self.hasher(k) % self.capacity
    bucket = self.data[index]
 
    for key, value in bucket:
        if key == k:
            return value
 
    return None
 

Resizing

If the capacity is static and we keep inserting key–value pairs, the buckets will grow longer, and both insertion and retrieval will degrade to linear time: . To avoid this, we implement resizing.

Resizing is usually triggered when the load factor exceeds a threshold (commonly 0.7 or 70%):

load_factor = length / capacity

If this value crosses the threshold, we:

  1. Create a new data array with larger capacity (typically 2× the current one)
  2. Rehash all existing keys
  3. Insert them into their new positions

This is necessary because the hash index depends on the capacity:

k = "some key"
hf(k) = 10
 
// Before resize
capacity = 3
index = 10 % 3 = 1
 
data = [ [ ], [("some key", 123)], [ ] ]
          0              1          2
 
// After resize
capacity = 20
index = 10 % 20 = 10
 
data = [ [ ], ..., [ ], [("some key", 123)], [  ], ..., [  ] ]
          0   ...   9             10          11   ...   19

Even keys that were previously in the same bucket might now be distributed across different ones. For example, keys with hashes 0, 3, and 6 would all end up in bucket 0 when capacity = 3, but would land in buckets 0, 3, and 6 respectively after resizing to capacity = 10.

Rehashing is essential to preserve the integrity and performance of the hash map.


Conclusion

Hash maps are deceptively simple on the surface — just keys and values, right? But once you peel back the layers, you start to see the careful balance between hashing, indexing, collision resolution, and resizing. These internals are what make dictionary lookups feel instant in high-level code.

Building a hash map from scratch is a great way to deepen your understanding of algorithmic trade-offs, memory layout, and the performance costs of seemingly simple operations. It’s one of those “build slow things” exercises that leaves you thinking faster.

In future posts, I might explore variations like open addressing, linear probing, or even hash maps optimized for performance-critical systems.

Thanks for reading!
If you’ve built your own hash map before — or hit unexpected edge cases — I’d love to hear about it.


Changelog

2025-07-11

  • Added changelog
  • Added Python code examples for insert and get methods