Hash maps are one of the most commonly used data structures in programming. They power everything from language interpreters to caching systems — but because they’re built into most languages, we rarely stop to think about how they actually work.
Depending on the language, they go by different names:
- Python:
dict - JavaScript:
Object(orMap) - Java:
HashMap - Go:
map - Rust:
HashMap - C++:
std::unordered_map
Despite the naming differences, the core idea is the same:
a way to associate keys with values and retrieve them efficiently.
In this post, we’ll build up the concept of a hash map from scratch — looking at what a hash function does, how data is stored, and how operations like insert, get, and resize work internally.
Hash Function
The typical signature of a hash function looks like this:
hf(K) -> intWhere K is any key type supported by the hash function. The function maps keys to integers, ideally distributing them uniformly over a target range.
Hash Map
A simple hash map implementation needs at least:
- A backing data structure (usually an array)
- An integer
capacity - A hash function
- An integer
lengthtracking the number of entries
Before diving into an implementation, let’s go over some terminology using this Python example:
KandV: Two generic type variables representing the key and value typesHasher: A function that takes aKand returns anint, represented asCallable[[K], int]Entry: A key–value pair, represented astuple[K, V]Bucket: A list of entries, represented aslist[Entry]
K = TypeVar("K")
V = TypeVar("V")
Hasher = Callable[[K], int]
Entry = tuple[K, V]
Bucket = list[Entry]
class HashMap(Generic[K, V]):
def __init__(self, capacity: int, hasher: Hasher) -> None:
self.len = 0
self.capacity = capacity
self.hasher = hasher
self.data: list[Bucket] = [[] for _ in range(self.capacity)]For the data structure, we use separate chaining, which means each array slot holds a list (or bucket) of key–value pairs:
Array<Array<(K, V)>>So the outer array holds buckets, and each bucket holds key–value tuples.
Insert
insert(k: K, v: V)To insert a key–value pair:
- Hash the key:
hf(k) - Convert it to an index:
index = hf(k) % capacity - Scan the bucket at
data[index]to check if the key already exists:- If it does, update the value.
- If not, append the new pair.
def insert(self, k: K, v: V):
index = self.hasher(k) % self.capacity
bucket = self.data[index]
for i, (key, _) in enumerate(bucket):
if key == k:
bucket[i] = (k, v) # Update existing key
return
bucket.append((k, v)) # Insert new pair
self.len += 1Get
get(k: K) -> Option<V>To retrieve a value:
- Hash the key:
hf(k) - Convert it to an index:
index = hf(k) % capacity - At
data[index], scan the bucket:- If the key is found, return the associated value.
- If not, return
None.
def get(self, k: K) -> Optional[V]:
index = self.hasher(k) % self.capacity
bucket = self.data[index]
for key, value in bucket:
if key == k:
return value
return None
Resizing
If the capacity is static and we keep inserting key–value pairs, the buckets will grow longer, and both insertion and retrieval will degrade to linear time: . To avoid this, we implement resizing.
Resizing is usually triggered when the load factor exceeds a threshold (commonly 0.7 or 70%):
load_factor = length / capacityIf this value crosses the threshold, we:
- Create a new data array with larger capacity (typically 2× the current one)
- Rehash all existing keys
- Insert them into their new positions
This is necessary because the hash index depends on the capacity:
k = "some key"
hf(k) = 10
// Before resize
capacity = 3
index = 10 % 3 = 1
data = [ [ ], [("some key", 123)], [ ] ]
0 1 2
// After resize
capacity = 20
index = 10 % 20 = 10
data = [ [ ], ..., [ ], [("some key", 123)], [ ], ..., [ ] ]
0 ... 9 10 11 ... 19Even keys that were previously in the same bucket might now be distributed across different ones. For example, keys with hashes 0, 3, and 6 would all end up in bucket 0 when capacity = 3, but would land in buckets 0, 3, and 6 respectively after resizing to capacity = 10.
Rehashing is essential to preserve the integrity and performance of the hash map.
Conclusion
Hash maps are deceptively simple on the surface — just keys and values, right? But once you peel back the layers, you start to see the careful balance between hashing, indexing, collision resolution, and resizing. These internals are what make dictionary lookups feel instant in high-level code.
Building a hash map from scratch is a great way to deepen your understanding of algorithmic trade-offs, memory layout, and the performance costs of seemingly simple operations. It’s one of those “build slow things” exercises that leaves you thinking faster.
In future posts, I might explore variations like open addressing, linear probing, or even hash maps optimized for performance-critical systems.
Thanks for reading!
If you’ve built your own hash map before — or hit unexpected edge cases — I’d love to hear about it.
Changelog
2025-07-11
- Added changelog
- Added Python code examples for
insertandgetmethods