Database Indexing
Indexes are the single biggest lever between a query that returns in 2 ms and one that melts the server. Every serious system-design interview round touches them — sometimes explicitly (“what index would you put on this?”), more often implicitly (“why is the dashboard slow?”). This post is my working cheat sheet for the five index types worth knowing, when each wins, and the optimization patterns that separate a junior answer from a staff-level one.
How to use this: read the tables first, skim the diagrams, come back for the prose. The goal isn’t to memorize every structure — it’s to pick the right one for a scenario and explain the trade-off in one sentence.
1. Why indexes exist (and what they cost)
Without an index, a query like WHERE email = 'alice@example.com' forces the database to sequentially scan every row until it hits a match. At 10 rows this is fine; at 10 million rows it’s fatal. An index is a separate, query-friendly data structure that tells the database exactly which pages on disk contain the record — like a book’s index that maps terms to page numbers.
flowchart LR
Q["<img src="/icons/mdi-database-search.svg" alt="mdi:database-search" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> SELECT<br/><span class='sub'>WHERE email = ?</span>"]:::ask
subgraph NOIDX ["No index — sequential scan"]
direction TB
S1["<img src="/icons/mdi-file-document-outline.svg" alt="mdi:file-document-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> page 1<br/><span class='sub'>1,000 rows</span>"]:::bad
S2["<img src="/icons/mdi-file-document-outline.svg" alt="mdi:file-document-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> page 2<br/><span class='sub'>1,000 rows</span>"]:::bad
S3["<img src="/icons/mdi-file-document-outline.svg" alt="mdi:file-document-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> … ×N pages<br/><span class='sub'>scan every one</span>"]:::bad
SN["<img src="/icons/mdi-help-circle-outline.svg" alt="mdi:help-circle-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> match?<br/><span class='sub'>maybe, eventually</span>"]:::bad
S1 --> S2 --> S3 --> SN
end
subgraph IDX ["With index — targeted read"]
direction TB
I1["<img src="/icons/mdi-file-tree.svg" alt="mdi:file-tree" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> B-tree lookup<br/><span class='sub'>2–3 reads</span>"]:::ok
I2["<img src="/icons/mdi-file-check-outline.svg" alt="mdi:file-check-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> page 42<br/><span class='sub'>1 row match</span>"]:::ok
I1 --> I2
end
Q ==>|"no index"| NOIDX
Q ==>|"indexed"| IDX
classDef clusterRed fill:transparent,stroke:#dc2626,stroke-dasharray:6 4
classDef clusterGreen fill:transparent,stroke:#16a34a,stroke-dasharray:6 4
class NOIDX clusterRed
class IDX clusterGreen
classDef ok stroke:#16a34a,stroke-width:1.75px;
classDef bad stroke:#dc2626,stroke-width:1.75px;
classDef ask stroke:#6366f1,stroke-width:1.75px;
The cost of indexes — easy to forget, but always asked:
- Disk space — each index is a second copy of the indexed data (sometimes the whole row).
- Write amplification — every
INSERT/UPDATEtouches the main table and every index. Five indexes on a hot-write table means each insert does six disk writes. - Not free under low volume — for a table with 500 rows, the index traversal cost can exceed a full scan.
When indexes hurt more than help: a logging table that’s 99% writes, a tiny configuration table, or a staging table rewritten every night. The rule of thumb is: index for read patterns you can name; leave everything else unindexed until profiling says otherwise.
2. How indexes actually work on disk
The main table data is written to a heap file — rows in insert order, no sort. When the database needs one row, it can’t know which page that row sits on without help. An index solves this by maintaining a separate sorted structure that maps from a key to a page pointer (or an offset inside a page).
Two facts that drive every index decision:
- Disk is always slower than RAM, even on modern NVMe SSDs. Random I/O is 10–100× slower than sequential I/O. Indexes exist largely to turn random scans into sequential reads + a small number of targeted page fetches.
- Databases read and write in fixed-size pages (typically 8 KB in Postgres, 16 KB in MySQL/InnoDB). Index structures are shaped to match these page boundaries — which is why B-trees use “fat” nodes with hundreds of keys instead of binary branching.
A good mental model: an index is a contract with the query planner. You promise the database that you will spend disk + write cycles maintaining this structure; in return, it promises to use it to avoid scanning millions of rows.
3. B-tree — the default for a reason
B-trees are the workhorse index. Postgres uses them for every CREATE INDEX unless you say otherwise; MySQL’s InnoDB stores the whole table as a clustered B-tree on the primary key; MongoDB’s default index is a B+ tree. If you pick a B-tree in an interview and can defend why, you’re rarely wrong.
What makes a B-tree a B-tree:
- Self-balancing — the tree grows evenly, not lopsided, so every lookup has the same predictable depth.
- High fan-out — each node holds hundreds of keys + pointers, keeping the tree shallow. For 1 billion rows you typically need only ~4 levels.
- Leaves sit at one depth — and in B+ trees (the common variant), leaves are linked in a doubly-linked list so range scans walk sideways instead of re-traversing the tree.
flowchart TB
R["<img src="/icons/mdi-file-tree.svg" alt="mdi:file-tree" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Root<br/><span class='sub'>[50, 100]</span>"]:::highlight
B1["Internal<br/><span class='sub'>[10, 25, 40]</span>"]:::step
B2["Internal<br/><span class='sub'>[60, 75, 90]</span>"]:::step
B3["Internal<br/><span class='sub'>[110, 150, 200]</span>"]:::step
R --> B1
R --> B2
R --> B3
L1["Leaf [1-25]"]:::ok
L2["Leaf [26-50]"]:::ok
L3["Leaf [51-90]"]:::ok
L4["Leaf [91-200]"]:::ok
B1 --> L1
B1 --> L2
B2 --> L3
B3 --> L4
L1 -.->|"next"| L2 -.->|"next"| L3 -.->|"next"| L4
classDef highlight fill:#f3ebff,stroke:#7c3aed,stroke-width:1.5px,color:#4c1d95;
classDef ok stroke:#16a34a,stroke-width:1.75px;
classDef step stroke:#3b6fd6,stroke-width:1.75px;
A lookup for id = 55 reads the root → finds the pointer for the 50-100 child → reads that internal node → follows the pointer for 51-60 → reads that leaf. Three disk pages for a billion-row table.
In Postgres a plain CREATE TABLE users (id SERIAL PRIMARY KEY, email VARCHAR(255) UNIQUE) creates two B-trees immediately: one for the PK, one for the unique constraint. This is also what makes WHERE email = ? fast without any extra work.
Why B-trees win as the default:
- Equality and range queries —
age > 25andBETWEEN 10 AND 20both use the index. - Cheap
ORDER BYon indexed columns — the tree is already sorted. - Stable performance under random inserts/deletes (the self-balancing part).
- Reach for anything else only when you have a specific reason.
4. LSM tree — when writes dominate
B-trees do one in-place write per update. For a system ingesting 100k writes per second — a metrics platform, a log collector, a Kafka-fed dashboard — that’s 100k random disk seeks per second, which saturates any SSD.
Log-Structured Merge trees solve this by never updating in place. They buffer everything in memory, write sorted chunks sequentially, and merge them in the background.
Write path
flowchart LR
W["<img src="/icons/mdi-database-plus.svg" alt="mdi:database-plus" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> INSERT / UPDATE<br/><span class='sub'>incoming write</span>"]:::step
WAL["<img src="/icons/mdi-file-document-edit-outline.svg" alt="mdi:file-document-edit-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> WAL<br/><span class='sub'>sequential append</span>"]:::cache
MEM["<img src="/icons/mdi-memory.svg" alt="mdi:memory" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Memtable<br/><span class='sub'>in-RAM sorted tree</span>"]:::cache
FLUSH["<img src="/icons/mdi-download.svg" alt="mdi:download" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Flush<br/><span class='sub'>~MB threshold</span>"]:::step
L0["<img src="/icons/mdi-database.svg" alt="mdi:database" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> SSTable L0<br/><span class='sub'>immutable</span>"]:::storage
L1["<img src="/icons/mdi-database.svg" alt="mdi:database" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> SSTable L1<br/><span class='sub'>compacted, merged</span>"]:::storage
L2["<img src="/icons/mdi-database.svg" alt="mdi:database" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> SSTable L2+<br/><span class='sub'>older, larger</span>"]:::storage
W ==> WAL
W ==> MEM
MEM ==>|"when full"| FLUSH
FLUSH --> L0
L0 -.->|"background compaction"| L1
L1 -.->|"background compaction"| L2
classDef cache stroke:#10b981,stroke-width:1.75px;
classDef storage stroke:#f59e0b,stroke-width:1.75px;
classDef step stroke:#3b6fd6,stroke-width:1.75px;
- Memtable — the in-RAM sorted buffer (typically a skiplist or red-black tree). Writes land here instantly.
- WAL (Write-Ahead Log) — a single append-only file for crash recovery. If the process dies, the WAL replays into a fresh memtable.
- SSTables — once the memtable is big enough (a few MB), it’s flushed to an immutable sorted file on disk. One sequential write of many MB instead of many random 8 KB writes.
- Compaction — a background job merges older SSTables, dropping tombstones for deleted keys and duplicate keys.
Read path — and why reads are harder
A single point lookup may have to check many SSTables because any of them could hold the most recent value for a key.
flowchart TB
Q["<img src="/icons/mdi-database-search.svg" alt="mdi:database-search" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> GET key = X<br/><span class='sub'>point lookup</span>"]:::ask
M["<img src="/icons/mdi-memory.svg" alt="mdi:memory" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Memtable<br/><span class='sub'>RAM</span>"]:::cache
BF1{{"Bloom filter<br/><span class='sub'>L0 SSTable</span>"}}:::ask
L0F["<img src="/icons/mdi-database.svg" alt="mdi:database" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> L0 SSTable<br/><span class='sub'>disk read</span>"]:::storage
BF2{{"Bloom filter<br/><span class='sub'>L1 SSTable</span>"}}:::ask
L1F["<img src="/icons/mdi-database.svg" alt="mdi:database" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> L1 SSTable<br/><span class='sub'>disk read</span>"]:::storage
DONE["<img src="/icons/mdi-check-circle-outline.svg" alt="mdi:check-circle-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> found<br/><span class='sub'>return value</span>"]:::ok
MISS["<img src="/icons/mdi-close-circle-outline.svg" alt="mdi:close-circle-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> not in table<br/><span class='sub'>skip</span>"]:::bad
Q --> M
M ==>|"hit"| DONE
M -->|"miss"| BF1
BF1 -->|"definitely not"| MISS
BF1 -->|"maybe"| L0F
L0F ==>|"hit"| DONE
L0F -->|"miss"| BF2
BF2 -->|"definitely not"| MISS
BF2 -->|"maybe"| L1F
L1F ==>|"hit"| DONE
classDef cache stroke:#10b981,stroke-width:1.75px;
classDef storage stroke:#f59e0b,stroke-width:1.75px;
classDef ok stroke:#16a34a,stroke-width:1.75px;
classDef bad stroke:#dc2626,stroke-width:1.75px;
classDef ask stroke:#6366f1,stroke-width:1.75px;
Bloom filters are the secret. Each SSTable ships with a tiny probabilistic filter that can say “this key is definitely NOT here” in a single hash operation. A bloom filter miss skips the SSTable entirely; only a positive (or “maybe”) answer triggers an actual disk read. In practice this keeps point-lookup reads at 1–2 disk fetches even with hundreds of SSTables.
Real-world LSM users:
- Cassandra — Netflix’s viewing-events storage.
- RocksDB — the storage engine under Kafka Streams, Flink state, many Facebook services.
- DynamoDB — publicly described as LSM-style internals.
- ScyllaDB — a Cassandra-compatible rewrite in C++.
When to pick LSM in an interview: write-heavy workloads where reads are rare or latency-tolerant — metrics, logs, event streams, IoT telemetry. The moment a read-heavy user-facing query enters the picture, B-trees usually win on tail latency.
5. Hash index — O(1) exact match, nothing else
A hash index is a persistent hashmap: hash(key) → bucket → row pointer. Exact-match lookups are constant-time; range queries, sorting, and prefix matching are impossible by design, because similar keys land in different buckets on purpose.
hash("alice@example.com") → bucket 42 → page 7, row 3hash("bob@example.com") → bucket 17 → page 12, row 0Why you rarely see them:
- Postgres supports them (
USING hash), doesn’t recommend them — B-trees handle equality almost as fast and give you range queries for free. - Only production-common place: in-memory stores. Redis’s top-level key→value map is a hash table. MySQL’s
MEMORYengine defaults to hash indexes. - “Hash indexes solve a problem we rarely have” — Bruce Momjian, PostgreSQL core dev.
Interview rule: mention them as an option, but don’t linger. Choosing a hash index usually signals you haven’t considered what else the query might need six months from now.
6. Geospatial — when 2D kills your B-tree
Proximity search — “restaurants within 5 km of me” — breaks B-trees. Two 1D indexes on latitude and longitude produce a rectangular filter region that’s much larger than the actual circular search. At scale (Uber, Yelp, Tinder) this is a 10,000× efficiency difference.
Three production approaches; you only need to defend one in an interview:
Geohash — turn 2D into a sortable string
Recursively subdivide the globe into base-32 cells and encode each level as one character:
9q— Western US9q8— Bay Area9q8y— San Francisco9q8yy— Mission District9q8yyk— one city block
flowchart LR
GPS["<img src="/icons/mdi-map-marker.svg" alt="mdi:map-marker" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> GPS point<br/><span class='sub'>37.77, -122.41</span>"]:::step
ENC["<img src="/icons/mdi-cog.svg" alt="mdi:cog" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Geohash encoder<br/><span class='sub'>recursive subdivide</span>"]:::step
STR["<img src="/icons/mdi-format-text.svg" alt="mdi:format-text" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> String<br/><span class='sub'>9q8yyk</span>"]:::step
BT["<img src="/icons/mdi-file-tree.svg" alt="mdi:file-tree" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> B-tree index<br/><span class='sub'>prefix range scan</span>"]:::database
NEAR["<img src="/icons/mdi-check-circle-outline.svg" alt="mdi:check-circle-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Nearby points<br/><span class='sub'>prefix 9q8yy*</span>"]:::ok
GPS --> ENC --> STR --> BT ==> NEAR
classDef database stroke:#f97316,stroke-width:1.75px;
classDef ok stroke:#16a34a,stroke-width:1.75px;
classDef step stroke:#3b6fd6,stroke-width:1.75px;
The elegance: close-in-reality locations share long common prefixes. A B-tree range scan on geohash LIKE '9q8yy%' returns all points in that cell. You get proximity search using nothing but a string column and a B-tree.
Redis’s geospatial commands (GEOADD, GEOSEARCH) use geohash internally. This is also the easiest to explain in an interview.
Quadtree — recursive 2D partitioning
Divide space into 4 quadrants; when a quadrant has more than a threshold of points, split it again. Adaptive — dense areas get deep subdivision, empty ocean stays shallow. Historically important; most modern databases don’t ship quadtrees but the idea shows up in map-tile systems (Google Maps tiles are a quadtree).
R-tree — production standard
Groups nearby objects into overlapping bounding rectangles that adapt to the data shape. Handles points and polygons in the same index (useful for “find delivery zones that cover this address”). PostGIS, MySQL’s SPATIAL INDEX, and SQLite R*Tree all use R-trees.
How to answer this in an interview: “A B-tree can’t serve proximity queries because lat/lon are independent dimensions. I’d reach for geohash if I can stay on Postgres without extensions — it turns 2D into string prefix matching so a plain B-tree works. For production GIS workloads I’d use PostGIS’s R-tree, which handles both points and polygons and is what Uber, Yelp, and most delivery apps actually run on.” That one sentence covers the whole topic.
7. Inverted index — full-text search
A B-tree can find an exact email, a range of timestamps, and a prefix. It cannot find the word "database" anywhere inside a TEXT column. WHERE content LIKE '%database%' is an unindexable full scan.
An inverted index flips the relationship:
Documents: doc1: "B-trees are fast and reliable" doc2: "Hash tables are fast but limited" doc3: "B-trees handle range queries well"
Inverted index: "b-trees" → [doc1, doc3] "fast" → [doc1, doc2] "hash" → [doc2] "range" → [doc3]Production inverted indexes (Elasticsearch, Lucene, Postgres tsvector) run each document through an analysis pipeline before storing:
flowchart LR
DOC["<img src="/icons/mdi-file-document-outline.svg" alt="mdi:file-document-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Raw text<br/><span class='sub'>The Databases Are Fast</span>"]:::step
TOK["<img src="/icons/mdi-scissors-cutting.svg" alt="mdi:scissors-cutting" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Tokenize<br/><span class='sub'>[The, Databases, …]</span>"]:::step
LOWER["<img src="/icons/mdi-format-letter-case-lower.svg" alt="mdi:format-letter-case-lower" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Lowercase<br/><span class='sub'>[the, databases, …]</span>"]:::step
STOP["<img src="/icons/mdi-filter-remove-outline.svg" alt="mdi:filter-remove-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Drop stop words<br/><span class='sub'>[databases, fast]</span>"]:::step
STEM["<img src="/icons/mdi-content-cut.svg" alt="mdi:content-cut" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Stem<br/><span class='sub'>[databas, fast]</span>"]:::step
MAP["<img src="/icons/logos-elasticsearch.svg" alt="logos:elasticsearch" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Inverted map<br/><span class='sub'>databas → [doc1]</span>"]:::ok
DOC --> TOK --> LOWER --> STOP --> STEM ==> MAP
classDef ok stroke:#16a34a,stroke-width:1.75px;
classDef step stroke:#3b6fd6,stroke-width:1.75px;
This is why a search for "Databases" matches database, databases, and database's — the analyzer collapses all of them to the stem databas. Elasticsearch layers relevance scoring (BM25 / TF-IDF), fuzzy matching, and phrase queries on top.
Production cost: inverted indexes are big (sometimes larger than the source corpus) and expensive to update — one document changing means touching every term it contains. That’s why Elasticsearch refresh intervals default to 1 second, not instant.
Where you’ll meet them in production: GitHub code search, Slack message history, any “search bar” on a content site, log search (OpenSearch / ELK), product catalogs. Even Postgres ships a usable inverted index via CREATE INDEX ... USING GIN(to_tsvector(...)).
8. Optimization patterns — composite + covering
Once you know which type of index to use, there are two optimization patterns interviewers love to probe.
Composite index — one B-tree, multiple columns
A classic feed query:
SELECT * FROM postsWHERE user_id = 123 AND created_at > '2026-01-01'ORDER BY created_at DESC;Two single-column indexes force the planner to fetch two big sets and intersect them, then sort. A single composite index gives both filtering and sorting in one B-tree walk:
CREATE INDEX idx_user_time ON posts(user_id, created_at);The index’s keys are tuples — sorted by user_id first, then created_at within each user:
flowchart TD
ROOT["<img src="/icons/mdi-file-tree.svg" alt="mdi:file-tree" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Composite B-tree<br/><span class='sub'>(user_id, created_at)</span>"]:::database
K1["<img src="/icons/mdi-key-outline.svg" alt="mdi:key-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> (1, 2026-01-01)"]:::step
K2["<img src="/icons/mdi-key-outline.svg" alt="mdi:key-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> (1, 2026-01-02)"]:::step
K3["<img src="/icons/mdi-key-outline.svg" alt="mdi:key-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> (1, 2026-01-03)"]:::step
K4["<img src="/icons/mdi-key-outline.svg" alt="mdi:key-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> (2, 2026-01-01)"]:::step
K5["<img src="/icons/mdi-key-outline.svg" alt="mdi:key-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> (3, 2026-01-01)"]:::step
ROOT --> K1
ROOT --> K2
ROOT --> K3
ROOT --> K4
ROOT --> K5
K3 ==>|"user_id=1 AND created_at > '2026-01-01'<br/>range-scan THIS subtree only"| HIT["<img src="/icons/mdi-check-circle-outline.svg" alt="mdi:check-circle-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> match"]:::highlight
classDef highlight fill:#f3ebff,stroke:#7c3aed,stroke-width:1.5px,color:#4c1d95;
classDef database stroke:#f97316,stroke-width:1.75px;
classDef step stroke:#3b6fd6,stroke-width:1.75px;
Column order rules of thumb:
- Equality first, range second, sort last.
(user_id, created_at)works;(created_at, user_id)doesn’t help the common query pattern. - The index is only useful for prefixes of the column list.
(a, b, c)acceleratesWHERE a = ?,WHERE a = ? AND b = ?,WHERE a = ? AND b = ? AND c = ?. It does not accelerateWHERE b = ?. - “Most selective column first” is a good default, but query patterns trump selectivity — if you always sort by
created_at, it must be in the index even if it’s not selective.
Covering index — answer without touching the table
Normal index lookup = (1) find the row in the index, (2) fetch the actual row from the heap for any column the index doesn’t carry. For read-heavy queries returning just a few columns, those heap fetches are wasted work.
A covering index includes the extra columns directly:
-- Normal: index on (user_id, created_at); "likes" requires a heap readCREATE INDEX idx_user_time ON posts(user_id, created_at);
-- Covering: "likes" travels with the index; query is answered purely from the indexCREATE INDEX idx_user_time_likes ON posts(user_id, created_at) INCLUDE (likes);Trade-off: the index is bigger (more storage, slower writes) but the query is an order of magnitude faster because it skips heap page reads. Reach for this when one hot query dominates your table and returns few columns — activity feeds, leaderboards, unread-count endpoints. Don’t sprinkle INCLUDE everywhere; modern query planners are good with plain indexes and the maintenance cost is real.
Write-side optimizations — shrink what you maintain
Every index costs write throughput. The goal here is to maintain less index per write without losing read speed.
Partial indexes — index only the rows a hot query actually touches. Classic case: an orders table where 99% of rows are status = 'closed' but every query filters on status = 'open':
-- Instead of a full index on status (bloated, rarely useful):CREATE INDEX idx_orders_status ON orders(status);
-- Index only the 1% that matters — drop size + write cost by ~100x:CREATE INDEX idx_open_orders ON orders(user_id, created_at) WHERE status = 'open';Inserts that land on closed orders skip this index entirely. The index stays tiny (1% of rows) and SELECT … WHERE status = 'open' AND user_id = ? uses it.
Fill factor + HOT updates (Postgres) — Postgres’s Heap-Only Tuple optimization lets an UPDATE re-write the row in the same page without touching any index, as long as no indexed column changes and the page has free space. Dropping fill factor leaves that free space:
ALTER TABLE posts SET (fillfactor = 85); -- 15% free per pageBiggest wins on hot-updated tables (user sessions, counters) where the updated columns are NOT indexed.
CREATE INDEX CONCURRENTLY — a regular CREATE INDEX takes an exclusive write lock. On a production table with 100 M rows, that’s minutes of downtime. The concurrent variant builds in the background:
CREATE INDEX CONCURRENTLY idx_posts_user_time ON posts(user_id, created_at);Slower overall, but no write lock. Non-negotiable for production schema changes.
Drop + recreate for bulk loads — if you’re importing 100 M rows from a dump, every existing index pays the write tax row-by-row. Drop the indexes, COPY the data, recreate the indexes once at the end. Typically 3–10× faster.
UNLOGGED / async commit for non-critical tables — staging tables, session stores, and rate-limit counters don’t need WAL durability. UNLOGGED tables skip the WAL and synchronous_commit = off lets the commit return before the WAL fsync. Accept the “lose the last second on crash” trade-off.
Read-side optimizations — answer more queries from less I/O
Functional (expression) indexes — if you query a computed value, index the computation, not the raw column:
-- WHERE lower(email) = ? — a regular index on email is NOT usedCREATE INDEX idx_users_email_lower ON users(lower(email));
-- WHERE (data->>'status') = 'active' — JSONB path indexCREATE INDEX idx_orders_status_jsonb ON orders((data->>'status'));A common interview gotcha: candidates add an index on created_at but the query is WHERE DATE(created_at) = CURRENT_DATE. The regular index is useless; you need INDEX ON (DATE(created_at)).
Bitmap index scans (Postgres) — when a query filters on several columns that each have their own single-column index, Postgres can scan each index, build a bitmap of matching pages, AND/OR the bitmaps, then fetch only pages in the result set. Lets you avoid a composite index when queries mix columns unpredictably:
-- Rather than one composite for every combination, keep single-column indexes:CREATE INDEX idx_status ON orders(status);CREATE INDEX idx_customer ON orders(customer_id);-- Postgres bitmap-AND-merges these automatically for-- WHERE status = 'open' AND customer_id = 42Don’t rely on this for the hottest query — a targeted composite index is still faster — but it’s the right tool when query shapes are ad-hoc and many.
Index-only scans + visibility map — a covering index is “only as covering as the visibility map allows.” Postgres has to check whether a row is visible to the current transaction; for ancient stable rows the visibility map says “all visible, trust the index.” For recently-updated rows it falls back to a heap fetch. Running VACUUM (or autovacuum tuned aggressively) updates the visibility map and keeps index-only scans actually … index-only.
Materialized views for expensive aggregations — if a dashboard recomputes SUM(amount) GROUP BY region, day from 10 B rows on every render, no index helps. Precompute:
CREATE MATERIALIZED VIEW sales_daily ASSELECT region, date_trunc('day', created_at) AS day, SUM(amount) AS totalFROM orders GROUP BY 1, 2;
CREATE UNIQUE INDEX ON sales_daily(region, day);REFRESH MATERIALIZED VIEW CONCURRENTLY sales_daily; -- nightly cronYou’ve traded freshness for read latency — the right move for reporting and analytics endpoints.
Balancing the trade-off
A realistic rule of thumb for the maintenance cost of an index:
Every index pays on every write. That’s why staff-level answers always name the write side when proposing a new index.
flowchart TD
Q["<img src="/icons/mdi-help-circle-outline.svg" alt="mdi:help-circle-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> New query arrives"]:::ask
RATE{{"Read rate ≫ write rate?"}}:::ask
HOT{{"Always filters same subset?"}}:::ask
SORT{{"Sorts / orders by a column?"}}:::ask
AGG{{"Aggregates over many rows?"}}:::ask
COMP["<img src="/icons/mdi-file-tree.svg" alt="mdi:file-tree" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Composite index<br/><span class='sub'>equality → range → sort</span>"]:::ok
PART["<img src="/icons/mdi-filter-outline.svg" alt="mdi:filter-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Partial index<br/><span class='sub'>WHERE predicate</span>"]:::ok
COVER["<img src="/icons/mdi-table-column.svg" alt="mdi:table-column" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Covering index<br/><span class='sub'>INCLUDE columns</span>"]:::ok
MV["<img src="/icons/mdi-table-eye.svg" alt="mdi:table-eye" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Materialized view<br/><span class='sub'>+ index the view</span>"]:::ok
NONE["<img src="/icons/mdi-close-circle-outline.svg" alt="mdi:close-circle-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Leave it; seq-scan<br/><span class='sub'>writes too hot</span>"]:::warn
Q --> RATE
RATE -->|"yes"| HOT
RATE -->|"no — writes dominate"| NONE
HOT -->|"yes"| PART
HOT -->|"no"| SORT
SORT -->|"yes"| COMP
SORT -->|"no"| AGG
AGG -->|"yes"| MV
AGG -->|"no — few columns"| COVER
classDef ok stroke:#16a34a,stroke-width:1.75px;
classDef warn stroke:#d97706,stroke-width:1.75px;
classDef ask stroke:#6366f1,stroke-width:1.75px;
Finding indexes you don’t actually need — Postgres keeps counters for every index:
-- Indexes that have never been used since the last resetSELECT schemaname, relname AS table, indexrelname AS index, idx_scanFROM pg_stat_user_indexesWHERE idx_scan = 0ORDER BY pg_relation_size(indexrelid) DESC;
-- Queries that do full table scans worth fixingSELECT query, calls, total_exec_timeFROM pg_stat_statementsWHERE query !~* 'pg_'ORDER BY total_exec_time DESCLIMIT 20;Production heuristic: drop any index with idx_scan = 0 after 30 days of traffic (except the ones only used by rare cron jobs — tag those). Every dropped index is free write throughput.
9. Cheat sheet — which index for what
flowchart TD
Q["<img src="/icons/mdi-help-circle-outline.svg" alt="mdi:help-circle-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> What is the query?"]:::ask
EQ{{"Exact match on a column?"}}:::ask
RANGE{{"Range / sort on a column?"}}:::ask
GEO{{"Proximity on (lat, lon)?"}}:::ask
TEXT{{"Substring inside TEXT?"}}:::ask
WRITE{{"Writes 10× reads?"}}:::ask
BTREE["<img src="/icons/mdi-file-tree.svg" alt="mdi:file-tree" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> B-tree<br/><span class='sub'>your default</span>"]:::highlight
LSM["<img src="/icons/mdi-database-arrow-down.svg" alt="mdi:database-arrow-down" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> LSM tree<br/><span class='sub'>time-series, logs</span>"]:::ok
HASH["<img src="/icons/mdi-pound.svg" alt="mdi:pound" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Hash<br/><span class='sub'>in-memory, no range</span>"]:::warn
GEOHASH["<img src="/icons/mdi-map-marker.svg" alt="mdi:map-marker" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Geohash<br/><span class='sub'>string + B-tree</span>"]:::ok
RTREE["<img src="/icons/mdi-vector-square.svg" alt="mdi:vector-square" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> R-tree / PostGIS<br/><span class='sub'>points + polygons</span>"]:::ok
INV["<img src="/icons/logos-elasticsearch.svg" alt="logos:elasticsearch" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Inverted index<br/><span class='sub'>tsvector, search</span>"]:::ok
Q --> EQ
Q --> RANGE
Q --> GEO
Q --> TEXT
Q --> WRITE
EQ ==>|"yes"| BTREE
EQ -->|"in-RAM, no sort"| HASH
RANGE ==> BTREE
GEO -->|"Postgres / Redis"| GEOHASH
GEO -->|"production GIS"| RTREE
TEXT --> INV
WRITE --> LSM
classDef highlight fill:#f3ebff,stroke:#7c3aed,stroke-width:1.5px,color:#4c1d95;
classDef ok stroke:#16a34a,stroke-width:1.75px;
classDef warn stroke:#d97706,stroke-width:1.75px;
classDef ask stroke:#6366f1,stroke-width:1.75px;
One-sentence summary per type
| Type | Pick when | Don’t pick when |
|---|---|---|
| B-tree | You need equality + range + sort on disk | Never wrong; default choice |
| LSM tree | Writes ≫ reads, time-series, append-heavy | User-facing latency-critical reads |
| Hash | In-memory, exact match, no range ever | Anywhere you might later need sort / range |
| Geohash | Proximity search on plain Postgres / Redis | You need polygon queries |
| R-tree / PostGIS | Production GIS — points + shapes | You only have points and no extension |
| Inverted | Substring or relevance search in TEXT | Small text columns where LIKE '%x%' is cheap |
Pitfalls to volunteer
- Over-indexing. Each index slows every write on the table; audit unused ones via
pg_stat_user_indexesin Postgres. - Leftmost-prefix rule.
(a, b, c)index does not helpWHERE b = ?. A composite index is an ordered commitment. - Indexes on low-cardinality columns are mostly useless. A B-tree on
statuswith three values (pending,active,done) rarely beats a seq scan. - Functional indexes. If you query
LOWER(email) = ?, you needCREATE INDEX ON users (LOWER(email))— a regular index onemailwon’t be used. NULLsemantics. Most indexes do store NULLs (Postgres, MySQL B-tree) but some operators won’t use them;WHERE col IS NOT NULLusually bypasses the index unless you have a partial index.- Partial indexes.
CREATE INDEX ON orders (user_id) WHERE status = 'open'keeps the index small when the hot query only looks at a sliver of the table — Postgres specialty. - Writes during read queries. Long-running analytical reads on an LSM-heavy table can pin old SSTables on disk; compaction can’t reclaim space. Watch tombstone accumulation in Cassandra.
- Index-only scans. Postgres only uses them if the visibility map says the page is all-visible. A covering index that looks covering still hits the heap on recently-updated rows. Vacuum matters.
Indexes are a single abstraction with many implementations, each tuned to a different workload shape. You’ll be fine in 95% of interviews by answering B-tree and defending it; the other 5% is recognizing the shape of the problem — heavy writes (LSM), proximity (geohash/R-tree), substring (inverted) — and reaching for the right tool. Everything else is optimization.