💭 note
3774 words
19 minutes

Database Indexing

Indexes are the single biggest lever between a query that returns in 2 ms and one that melts the server. Every serious system-design interview round touches them — sometimes explicitly (“what index would you put on this?”), more often implicitly (“why is the dashboard slow?”). This post is my working cheat sheet for the five index types worth knowing, when each wins, and the optimization patterns that separate a junior answer from a staff-level one.

How to use this: read the tables first, skim the diagrams, come back for the prose. The goal isn’t to memorize every structure — it’s to pick the right one for a scenario and explain the trade-off in one sentence.


1. Why indexes exist (and what they cost)#

Without an index, a query like WHERE email = 'alice@example.com' forces the database to sequentially scan every row until it hits a match. At 10 rows this is fine; at 10 million rows it’s fatal. An index is a separate, query-friendly data structure that tells the database exactly which pages on disk contain the record — like a book’s index that maps terms to page numbers.

Sequential scan vs indexed lookup
flowchart LR
    Q["<img src="/icons/mdi-database-search.svg" alt="mdi:database-search" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> SELECT<br/><span class='sub'>WHERE email = ?</span>"]:::ask

    subgraph NOIDX ["No index — sequential scan"]
        direction TB
        S1["<img src="/icons/mdi-file-document-outline.svg" alt="mdi:file-document-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> page 1<br/><span class='sub'>1,000 rows</span>"]:::bad
        S2["<img src="/icons/mdi-file-document-outline.svg" alt="mdi:file-document-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> page 2<br/><span class='sub'>1,000 rows</span>"]:::bad
        S3["<img src="/icons/mdi-file-document-outline.svg" alt="mdi:file-document-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> … ×N pages<br/><span class='sub'>scan every one</span>"]:::bad
        SN["<img src="/icons/mdi-help-circle-outline.svg" alt="mdi:help-circle-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> match?<br/><span class='sub'>maybe, eventually</span>"]:::bad
        S1 --> S2 --> S3 --> SN
    end

    subgraph IDX ["With index — targeted read"]
        direction TB
        I1["<img src="/icons/mdi-file-tree.svg" alt="mdi:file-tree" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> B-tree lookup<br/><span class='sub'>2–3 reads</span>"]:::ok
        I2["<img src="/icons/mdi-file-check-outline.svg" alt="mdi:file-check-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> page 42<br/><span class='sub'>1 row match</span>"]:::ok
        I1 --> I2
    end

    Q ==>|"no index"| NOIDX
    Q ==>|"indexed"| IDX

    classDef clusterRed fill:transparent,stroke:#dc2626,stroke-dasharray:6 4
    classDef clusterGreen fill:transparent,stroke:#16a34a,stroke-dasharray:6 4
    class NOIDX clusterRed
    class IDX clusterGreen

classDef ok stroke:#16a34a,stroke-width:1.75px;
classDef bad stroke:#dc2626,stroke-width:1.75px;
classDef ask stroke:#6366f1,stroke-width:1.75px;
Without an index, every page must be examined. A B-tree narrows a billion-row lookup to two or three page reads.

The cost of indexes — easy to forget, but always asked:

  • Disk space — each index is a second copy of the indexed data (sometimes the whole row).
  • Write amplification — every INSERT / UPDATE touches the main table and every index. Five indexes on a hot-write table means each insert does six disk writes.
  • Not free under low volume — for a table with 500 rows, the index traversal cost can exceed a full scan.

When indexes hurt more than help: a logging table that’s 99% writes, a tiny configuration table, or a staging table rewritten every night. The rule of thumb is: index for read patterns you can name; leave everything else unindexed until profiling says otherwise.

2. How indexes actually work on disk#

The main table data is written to a heap file — rows in insert order, no sort. When the database needs one row, it can’t know which page that row sits on without help. An index solves this by maintaining a separate sorted structure that maps from a key to a page pointer (or an offset inside a page).

Two facts that drive every index decision:

  • Disk is always slower than RAM, even on modern NVMe SSDs. Random I/O is 10–100× slower than sequential I/O. Indexes exist largely to turn random scans into sequential reads + a small number of targeted page fetches.
  • Databases read and write in fixed-size pages (typically 8 KB in Postgres, 16 KB in MySQL/InnoDB). Index structures are shaped to match these page boundaries — which is why B-trees use “fat” nodes with hundreds of keys instead of binary branching.

A good mental model: an index is a contract with the query planner. You promise the database that you will spend disk + write cycles maintaining this structure; in return, it promises to use it to avoid scanning millions of rows.


3. B-tree — the default for a reason#

B-trees are the workhorse index. Postgres uses them for every CREATE INDEX unless you say otherwise; MySQL’s InnoDB stores the whole table as a clustered B-tree on the primary key; MongoDB’s default index is a B+ tree. If you pick a B-tree in an interview and can defend why, you’re rarely wrong.

What makes a B-tree a B-tree:

  • Self-balancing — the tree grows evenly, not lopsided, so every lookup has the same predictable depth.
  • High fan-out — each node holds hundreds of keys + pointers, keeping the tree shallow. For 1 billion rows you typically need only ~4 levels.
  • Leaves sit at one depth — and in B+ trees (the common variant), leaves are linked in a doubly-linked list so range scans walk sideways instead of re-traversing the tree.
B+ tree with linked leaves
flowchart TB
    R["<img src="/icons/mdi-file-tree.svg" alt="mdi:file-tree" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Root<br/><span class='sub'>[50, 100]</span>"]:::highlight

    B1["Internal<br/><span class='sub'>[10, 25, 40]</span>"]:::step
    B2["Internal<br/><span class='sub'>[60, 75, 90]</span>"]:::step
    B3["Internal<br/><span class='sub'>[110, 150, 200]</span>"]:::step

    R --> B1
    R --> B2
    R --> B3

    L1["Leaf [1-25]"]:::ok
    L2["Leaf [26-50]"]:::ok
    L3["Leaf [51-90]"]:::ok
    L4["Leaf [91-200]"]:::ok

    B1 --> L1
    B1 --> L2
    B2 --> L3
    B3 --> L4

    L1 -.->|"next"| L2 -.->|"next"| L3 -.->|"next"| L4

    classDef highlight fill:#f3ebff,stroke:#7c3aed,stroke-width:1.5px,color:#4c1d95;

classDef ok stroke:#16a34a,stroke-width:1.75px;
classDef step stroke:#3b6fd6,stroke-width:1.75px;
Internal nodes fan out to keep the tree shallow; leaves form a doubly-linked list so range scans walk sideways instead of re-traversing the tree.

A lookup for id = 55 reads the root → finds the pointer for the 50-100 child → reads that internal node → follows the pointer for 51-60 → reads that leaf. Three disk pages for a billion-row table.

In Postgres a plain CREATE TABLE users (id SERIAL PRIMARY KEY, email VARCHAR(255) UNIQUE) creates two B-trees immediately: one for the PK, one for the unique constraint. This is also what makes WHERE email = ? fast without any extra work.

Why B-trees win as the default:

  • Equality and range queries — age > 25 and BETWEEN 10 AND 20 both use the index.
  • Cheap ORDER BY on indexed columns — the tree is already sorted.
  • Stable performance under random inserts/deletes (the self-balancing part).
  • Reach for anything else only when you have a specific reason.

4. LSM tree — when writes dominate#

B-trees do one in-place write per update. For a system ingesting 100k writes per second — a metrics platform, a log collector, a Kafka-fed dashboard — that’s 100k random disk seeks per second, which saturates any SSD.

Log-Structured Merge trees solve this by never updating in place. They buffer everything in memory, write sorted chunks sequentially, and merge them in the background.

Write path#

LSM-tree write path
flowchart LR
    W["<img src="/icons/mdi-database-plus.svg" alt="mdi:database-plus" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> INSERT / UPDATE<br/><span class='sub'>incoming write</span>"]:::step
    WAL["<img src="/icons/mdi-file-document-edit-outline.svg" alt="mdi:file-document-edit-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> WAL<br/><span class='sub'>sequential append</span>"]:::cache
    MEM["<img src="/icons/mdi-memory.svg" alt="mdi:memory" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Memtable<br/><span class='sub'>in-RAM sorted tree</span>"]:::cache
    FLUSH["<img src="/icons/mdi-download.svg" alt="mdi:download" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Flush<br/><span class='sub'>~MB threshold</span>"]:::step
    L0["<img src="/icons/mdi-database.svg" alt="mdi:database" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> SSTable L0<br/><span class='sub'>immutable</span>"]:::storage
    L1["<img src="/icons/mdi-database.svg" alt="mdi:database" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> SSTable L1<br/><span class='sub'>compacted, merged</span>"]:::storage
    L2["<img src="/icons/mdi-database.svg" alt="mdi:database" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> SSTable L2+<br/><span class='sub'>older, larger</span>"]:::storage

    W ==> WAL
    W ==> MEM
    MEM ==>|"when full"| FLUSH
    FLUSH --> L0
    L0 -.->|"background compaction"| L1
    L1 -.->|"background compaction"| L2

classDef cache stroke:#10b981,stroke-width:1.75px;
classDef storage stroke:#f59e0b,stroke-width:1.75px;
classDef step stroke:#3b6fd6,stroke-width:1.75px;
Writes land in RAM plus the WAL instantly; the memtable flushes to an immutable SSTable and background compaction merges older levels.
  • Memtable — the in-RAM sorted buffer (typically a skiplist or red-black tree). Writes land here instantly.
  • WAL (Write-Ahead Log) — a single append-only file for crash recovery. If the process dies, the WAL replays into a fresh memtable.
  • SSTables — once the memtable is big enough (a few MB), it’s flushed to an immutable sorted file on disk. One sequential write of many MB instead of many random 8 KB writes.
  • Compaction — a background job merges older SSTables, dropping tombstones for deleted keys and duplicate keys.

Read path — and why reads are harder#

A single point lookup may have to check many SSTables because any of them could hold the most recent value for a key.

LSM-tree read path with bloom filters
flowchart TB
    Q["<img src="/icons/mdi-database-search.svg" alt="mdi:database-search" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> GET key = X<br/><span class='sub'>point lookup</span>"]:::ask
    M["<img src="/icons/mdi-memory.svg" alt="mdi:memory" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Memtable<br/><span class='sub'>RAM</span>"]:::cache
    BF1{{"Bloom filter<br/><span class='sub'>L0 SSTable</span>"}}:::ask
    L0F["<img src="/icons/mdi-database.svg" alt="mdi:database" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> L0 SSTable<br/><span class='sub'>disk read</span>"]:::storage
    BF2{{"Bloom filter<br/><span class='sub'>L1 SSTable</span>"}}:::ask
    L1F["<img src="/icons/mdi-database.svg" alt="mdi:database" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> L1 SSTable<br/><span class='sub'>disk read</span>"]:::storage
    DONE["<img src="/icons/mdi-check-circle-outline.svg" alt="mdi:check-circle-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> found<br/><span class='sub'>return value</span>"]:::ok
    MISS["<img src="/icons/mdi-close-circle-outline.svg" alt="mdi:close-circle-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> not in table<br/><span class='sub'>skip</span>"]:::bad

    Q --> M
    M ==>|"hit"| DONE
    M -->|"miss"| BF1
    BF1 -->|"definitely not"| MISS
    BF1 -->|"maybe"| L0F
    L0F ==>|"hit"| DONE
    L0F -->|"miss"| BF2
    BF2 -->|"definitely not"| MISS
    BF2 -->|"maybe"| L1F
    L1F ==>|"hit"| DONE

classDef cache stroke:#10b981,stroke-width:1.75px;
classDef storage stroke:#f59e0b,stroke-width:1.75px;
classDef ok stroke:#16a34a,stroke-width:1.75px;
classDef bad stroke:#dc2626,stroke-width:1.75px;
classDef ask stroke:#6366f1,stroke-width:1.75px;
Check memtable first; for each SSTable a tiny bloom filter answers 'definitely not' without a disk read — only a 'maybe' triggers an actual fetch.

Bloom filters are the secret. Each SSTable ships with a tiny probabilistic filter that can say “this key is definitely NOT here” in a single hash operation. A bloom filter miss skips the SSTable entirely; only a positive (or “maybe”) answer triggers an actual disk read. In practice this keeps point-lookup reads at 1–2 disk fetches even with hundreds of SSTables.

Real-world LSM users:

  • Cassandra — Netflix’s viewing-events storage.
  • RocksDB — the storage engine under Kafka Streams, Flink state, many Facebook services.
  • DynamoDB — publicly described as LSM-style internals.
  • ScyllaDB — a Cassandra-compatible rewrite in C++.

When to pick LSM in an interview: write-heavy workloads where reads are rare or latency-tolerant — metrics, logs, event streams, IoT telemetry. The moment a read-heavy user-facing query enters the picture, B-trees usually win on tail latency.

5. Hash index — O(1) exact match, nothing else#

A hash index is a persistent hashmap: hash(key) → bucket → row pointer. Exact-match lookups are constant-time; range queries, sorting, and prefix matching are impossible by design, because similar keys land in different buckets on purpose.

hash("alice@example.com") → bucket 42 → page 7, row 3
hash("bob@example.com") → bucket 17 → page 12, row 0

Why you rarely see them:

  • Postgres supports them (USING hash), doesn’t recommend them — B-trees handle equality almost as fast and give you range queries for free.
  • Only production-common place: in-memory stores. Redis’s top-level key→value map is a hash table. MySQL’s MEMORY engine defaults to hash indexes.
  • “Hash indexes solve a problem we rarely have” — Bruce Momjian, PostgreSQL core dev.

Interview rule: mention them as an option, but don’t linger. Choosing a hash index usually signals you haven’t considered what else the query might need six months from now.

6. Geospatial — when 2D kills your B-tree#

Proximity search — “restaurants within 5 km of me” — breaks B-trees. Two 1D indexes on latitude and longitude produce a rectangular filter region that’s much larger than the actual circular search. At scale (Uber, Yelp, Tinder) this is a 10,000× efficiency difference.

Three production approaches; you only need to defend one in an interview:

Geohash — turn 2D into a sortable string#

Recursively subdivide the globe into base-32 cells and encode each level as one character:

  • 9q — Western US
  • 9q8 — Bay Area
  • 9q8y — San Francisco
  • 9q8yy — Mission District
  • 9q8yyk — one city block
Geohash — 2D proximity as a string prefix
flowchart LR
    GPS["<img src="/icons/mdi-map-marker.svg" alt="mdi:map-marker" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> GPS point<br/><span class='sub'>37.77, -122.41</span>"]:::step
    ENC["<img src="/icons/mdi-cog.svg" alt="mdi:cog" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Geohash encoder<br/><span class='sub'>recursive subdivide</span>"]:::step
    STR["<img src="/icons/mdi-format-text.svg" alt="mdi:format-text" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> String<br/><span class='sub'>9q8yyk</span>"]:::step
    BT["<img src="/icons/mdi-file-tree.svg" alt="mdi:file-tree" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> B-tree index<br/><span class='sub'>prefix range scan</span>"]:::database
    NEAR["<img src="/icons/mdi-check-circle-outline.svg" alt="mdi:check-circle-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Nearby points<br/><span class='sub'>prefix 9q8yy*</span>"]:::ok

    GPS --> ENC --> STR --> BT ==> NEAR

classDef database stroke:#f97316,stroke-width:1.75px;
classDef ok stroke:#16a34a,stroke-width:1.75px;
classDef step stroke:#3b6fd6,stroke-width:1.75px;
Encode lat/lon into a base-32 string; nearby points share long prefixes, so a plain B-tree prefix range scan serves proximity queries.

The elegance: close-in-reality locations share long common prefixes. A B-tree range scan on geohash LIKE '9q8yy%' returns all points in that cell. You get proximity search using nothing but a string column and a B-tree.

Redis’s geospatial commands (GEOADD, GEOSEARCH) use geohash internally. This is also the easiest to explain in an interview.

Quadtree — recursive 2D partitioning#

Divide space into 4 quadrants; when a quadrant has more than a threshold of points, split it again. Adaptive — dense areas get deep subdivision, empty ocean stays shallow. Historically important; most modern databases don’t ship quadtrees but the idea shows up in map-tile systems (Google Maps tiles are a quadtree).

R-tree — production standard#

Groups nearby objects into overlapping bounding rectangles that adapt to the data shape. Handles points and polygons in the same index (useful for “find delivery zones that cover this address”). PostGIS, MySQL’s SPATIAL INDEX, and SQLite R*Tree all use R-trees.

How to answer this in an interview: “A B-tree can’t serve proximity queries because lat/lon are independent dimensions. I’d reach for geohash if I can stay on Postgres without extensions — it turns 2D into string prefix matching so a plain B-tree works. For production GIS workloads I’d use PostGIS’s R-tree, which handles both points and polygons and is what Uber, Yelp, and most delivery apps actually run on.” That one sentence covers the whole topic.

A B-tree can find an exact email, a range of timestamps, and a prefix. It cannot find the word "database" anywhere inside a TEXT column. WHERE content LIKE '%database%' is an unindexable full scan.

An inverted index flips the relationship:

Documents:
doc1: "B-trees are fast and reliable"
doc2: "Hash tables are fast but limited"
doc3: "B-trees handle range queries well"
Inverted index:
"b-trees" → [doc1, doc3]
"fast" → [doc1, doc2]
"hash" → [doc2]
"range" → [doc3]

Production inverted indexes (Elasticsearch, Lucene, Postgres tsvector) run each document through an analysis pipeline before storing:

Inverted index — analysis pipeline
flowchart LR
    DOC["<img src="/icons/mdi-file-document-outline.svg" alt="mdi:file-document-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Raw text<br/><span class='sub'>The Databases Are Fast</span>"]:::step
    TOK["<img src="/icons/mdi-scissors-cutting.svg" alt="mdi:scissors-cutting" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Tokenize<br/><span class='sub'>[The, Databases, …]</span>"]:::step
    LOWER["<img src="/icons/mdi-format-letter-case-lower.svg" alt="mdi:format-letter-case-lower" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Lowercase<br/><span class='sub'>[the, databases, …]</span>"]:::step
    STOP["<img src="/icons/mdi-filter-remove-outline.svg" alt="mdi:filter-remove-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Drop stop words<br/><span class='sub'>[databases, fast]</span>"]:::step
    STEM["<img src="/icons/mdi-content-cut.svg" alt="mdi:content-cut" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Stem<br/><span class='sub'>[databas, fast]</span>"]:::step
    MAP["<img src="/icons/logos-elasticsearch.svg" alt="logos:elasticsearch" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Inverted map<br/><span class='sub'>databas → [doc1]</span>"]:::ok

    DOC --> TOK --> LOWER --> STOP --> STEM ==> MAP

classDef ok stroke:#16a34a,stroke-width:1.75px;
classDef step stroke:#3b6fd6,stroke-width:1.75px;
Tokenize, lowercase, drop stop words, stem — 'Databases' and 'database's' collapse to the stem 'databas' and share one postings list.

This is why a search for "Databases" matches database, databases, and database's — the analyzer collapses all of them to the stem databas. Elasticsearch layers relevance scoring (BM25 / TF-IDF), fuzzy matching, and phrase queries on top.

Production cost: inverted indexes are big (sometimes larger than the source corpus) and expensive to update — one document changing means touching every term it contains. That’s why Elasticsearch refresh intervals default to 1 second, not instant.

Where you’ll meet them in production: GitHub code search, Slack message history, any “search bar” on a content site, log search (OpenSearch / ELK), product catalogs. Even Postgres ships a usable inverted index via CREATE INDEX ... USING GIN(to_tsvector(...)).

8. Optimization patterns — composite + covering#

Once you know which type of index to use, there are two optimization patterns interviewers love to probe.

Composite index — one B-tree, multiple columns#

A classic feed query:

SELECT * FROM posts
WHERE user_id = 123
AND created_at > '2026-01-01'
ORDER BY created_at DESC;

Two single-column indexes force the planner to fetch two big sets and intersect them, then sort. A single composite index gives both filtering and sorting in one B-tree walk:

CREATE INDEX idx_user_time ON posts(user_id, created_at);

The index’s keys are tuples — sorted by user_id first, then created_at within each user:

Composite B-tree keyed by (user_id, created_at)
flowchart TD
    ROOT["<img src="/icons/mdi-file-tree.svg" alt="mdi:file-tree" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Composite B-tree<br/><span class='sub'>(user_id, created_at)</span>"]:::database
    K1["<img src="/icons/mdi-key-outline.svg" alt="mdi:key-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> (1, 2026-01-01)"]:::step
    K2["<img src="/icons/mdi-key-outline.svg" alt="mdi:key-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> (1, 2026-01-02)"]:::step
    K3["<img src="/icons/mdi-key-outline.svg" alt="mdi:key-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> (1, 2026-01-03)"]:::step
    K4["<img src="/icons/mdi-key-outline.svg" alt="mdi:key-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> (2, 2026-01-01)"]:::step
    K5["<img src="/icons/mdi-key-outline.svg" alt="mdi:key-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> (3, 2026-01-01)"]:::step

    ROOT --> K1
    ROOT --> K2
    ROOT --> K3
    ROOT --> K4
    ROOT --> K5

    K3 ==>|"user_id=1 AND created_at > '2026-01-01'<br/>range-scan THIS subtree only"| HIT["<img src="/icons/mdi-check-circle-outline.svg" alt="mdi:check-circle-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> match"]:::highlight

    classDef highlight fill:#f3ebff,stroke:#7c3aed,stroke-width:1.5px,color:#4c1d95;

classDef database stroke:#f97316,stroke-width:1.75px;
classDef step stroke:#3b6fd6,stroke-width:1.75px;
Keys are tuples sorted by user_id first, then created_at within each user — WHERE user_id = ? AND created_at > ? range-scans exactly the right subtree.

Column order rules of thumb:

  • Equality first, range second, sort last. (user_id, created_at) works; (created_at, user_id) doesn’t help the common query pattern.
  • The index is only useful for prefixes of the column list. (a, b, c) accelerates WHERE a = ?, WHERE a = ? AND b = ?, WHERE a = ? AND b = ? AND c = ?. It does not accelerate WHERE b = ?.
  • “Most selective column first” is a good default, but query patterns trump selectivity — if you always sort by created_at, it must be in the index even if it’s not selective.

Covering index — answer without touching the table#

Normal index lookup = (1) find the row in the index, (2) fetch the actual row from the heap for any column the index doesn’t carry. For read-heavy queries returning just a few columns, those heap fetches are wasted work.

A covering index includes the extra columns directly:

-- Normal: index on (user_id, created_at); "likes" requires a heap read
CREATE INDEX idx_user_time ON posts(user_id, created_at);
-- Covering: "likes" travels with the index; query is answered purely from the index
CREATE INDEX idx_user_time_likes ON posts(user_id, created_at) INCLUDE (likes);

Trade-off: the index is bigger (more storage, slower writes) but the query is an order of magnitude faster because it skips heap page reads. Reach for this when one hot query dominates your table and returns few columns — activity feeds, leaderboards, unread-count endpoints. Don’t sprinkle INCLUDE everywhere; modern query planners are good with plain indexes and the maintenance cost is real.

Write-side optimizations — shrink what you maintain#

Every index costs write throughput. The goal here is to maintain less index per write without losing read speed.

Partial indexes — index only the rows a hot query actually touches. Classic case: an orders table where 99% of rows are status = 'closed' but every query filters on status = 'open':

-- Instead of a full index on status (bloated, rarely useful):
CREATE INDEX idx_orders_status ON orders(status);
-- Index only the 1% that matters — drop size + write cost by ~100x:
CREATE INDEX idx_open_orders ON orders(user_id, created_at)
WHERE status = 'open';

Inserts that land on closed orders skip this index entirely. The index stays tiny (1% of rows) and SELECT … WHERE status = 'open' AND user_id = ? uses it.

Fill factor + HOT updates (Postgres) — Postgres’s Heap-Only Tuple optimization lets an UPDATE re-write the row in the same page without touching any index, as long as no indexed column changes and the page has free space. Dropping fill factor leaves that free space:

ALTER TABLE posts SET (fillfactor = 85); -- 15% free per page

Biggest wins on hot-updated tables (user sessions, counters) where the updated columns are NOT indexed.

CREATE INDEX CONCURRENTLY — a regular CREATE INDEX takes an exclusive write lock. On a production table with 100 M rows, that’s minutes of downtime. The concurrent variant builds in the background:

CREATE INDEX CONCURRENTLY idx_posts_user_time ON posts(user_id, created_at);

Slower overall, but no write lock. Non-negotiable for production schema changes.

Drop + recreate for bulk loads — if you’re importing 100 M rows from a dump, every existing index pays the write tax row-by-row. Drop the indexes, COPY the data, recreate the indexes once at the end. Typically 3–10× faster.

UNLOGGED / async commit for non-critical tables — staging tables, session stores, and rate-limit counters don’t need WAL durability. UNLOGGED tables skip the WAL and synchronous_commit = off lets the commit return before the WAL fsync. Accept the “lose the last second on crash” trade-off.

Read-side optimizations — answer more queries from less I/O#

Functional (expression) indexes — if you query a computed value, index the computation, not the raw column:

-- WHERE lower(email) = ? — a regular index on email is NOT used
CREATE INDEX idx_users_email_lower ON users(lower(email));
-- WHERE (data->>'status') = 'active' — JSONB path index
CREATE INDEX idx_orders_status_jsonb ON orders((data->>'status'));

A common interview gotcha: candidates add an index on created_at but the query is WHERE DATE(created_at) = CURRENT_DATE. The regular index is useless; you need INDEX ON (DATE(created_at)).

Bitmap index scans (Postgres) — when a query filters on several columns that each have their own single-column index, Postgres can scan each index, build a bitmap of matching pages, AND/OR the bitmaps, then fetch only pages in the result set. Lets you avoid a composite index when queries mix columns unpredictably:

-- Rather than one composite for every combination, keep single-column indexes:
CREATE INDEX idx_status ON orders(status);
CREATE INDEX idx_customer ON orders(customer_id);
-- Postgres bitmap-AND-merges these automatically for
-- WHERE status = 'open' AND customer_id = 42

Don’t rely on this for the hottest query — a targeted composite index is still faster — but it’s the right tool when query shapes are ad-hoc and many.

Index-only scans + visibility map — a covering index is “only as covering as the visibility map allows.” Postgres has to check whether a row is visible to the current transaction; for ancient stable rows the visibility map says “all visible, trust the index.” For recently-updated rows it falls back to a heap fetch. Running VACUUM (or autovacuum tuned aggressively) updates the visibility map and keeps index-only scans actually … index-only.

Materialized views for expensive aggregations — if a dashboard recomputes SUM(amount) GROUP BY region, day from 10 B rows on every render, no index helps. Precompute:

CREATE MATERIALIZED VIEW sales_daily AS
SELECT region, date_trunc('day', created_at) AS day, SUM(amount) AS total
FROM orders GROUP BY 1, 2;
CREATE UNIQUE INDEX ON sales_daily(region, day);
REFRESH MATERIALIZED VIEW CONCURRENTLY sales_daily; -- nightly cron

You’ve traded freshness for read latency — the right move for reporting and analytics endpoints.

Balancing the trade-off#

A realistic rule of thumb for the maintenance cost of an index:

write cost    Nindexes×Wwidth×Rwrite-rate\text{write cost} \;\propto\; N_\text{indexes} \times W_\text{width} \times R_\text{write-rate}

Every index pays on every write. That’s why staff-level answers always name the write side when proposing a new index.

Which index optimization to reach for
flowchart TD
    Q["<img src="/icons/mdi-help-circle-outline.svg" alt="mdi:help-circle-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> New query arrives"]:::ask
    RATE{{"Read rate ≫ write rate?"}}:::ask
    HOT{{"Always filters same subset?"}}:::ask
    SORT{{"Sorts / orders by a column?"}}:::ask
    AGG{{"Aggregates over many rows?"}}:::ask

    COMP["<img src="/icons/mdi-file-tree.svg" alt="mdi:file-tree" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Composite index<br/><span class='sub'>equality → range → sort</span>"]:::ok
    PART["<img src="/icons/mdi-filter-outline.svg" alt="mdi:filter-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Partial index<br/><span class='sub'>WHERE predicate</span>"]:::ok
    COVER["<img src="/icons/mdi-table-column.svg" alt="mdi:table-column" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Covering index<br/><span class='sub'>INCLUDE columns</span>"]:::ok
    MV["<img src="/icons/mdi-table-eye.svg" alt="mdi:table-eye" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Materialized view<br/><span class='sub'>+ index the view</span>"]:::ok
    NONE["<img src="/icons/mdi-close-circle-outline.svg" alt="mdi:close-circle-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Leave it; seq-scan<br/><span class='sub'>writes too hot</span>"]:::warn

    Q --> RATE
    RATE -->|"yes"| HOT
    RATE -->|"no — writes dominate"| NONE

    HOT -->|"yes"| PART
    HOT -->|"no"| SORT

    SORT -->|"yes"| COMP
    SORT -->|"no"| AGG

    AGG -->|"yes"| MV
    AGG -->|"no — few columns"| COVER

classDef ok stroke:#16a34a,stroke-width:1.75px;
classDef warn stroke:#d97706,stroke-width:1.75px;
classDef ask stroke:#6366f1,stroke-width:1.75px;
Walk down: writes dominate → no index; hot filter subset → partial; sort-ordered → composite; aggregate-heavy → materialized view; covering last.

Finding indexes you don’t actually need — Postgres keeps counters for every index:

-- Indexes that have never been used since the last reset
SELECT schemaname, relname AS table, indexrelname AS index, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;
-- Queries that do full table scans worth fixing
SELECT query, calls, total_exec_time
FROM pg_stat_statements
WHERE query !~* 'pg_'
ORDER BY total_exec_time DESC
LIMIT 20;

Production heuristic: drop any index with idx_scan = 0 after 30 days of traffic (except the ones only used by rare cron jobs — tag those). Every dropped index is free write throughput.

9. Cheat sheet — which index for what#

Which index type for which query
flowchart TD
    Q["<img src="/icons/mdi-help-circle-outline.svg" alt="mdi:help-circle-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> What is the query?"]:::ask

    EQ{{"Exact match on a column?"}}:::ask
    RANGE{{"Range / sort on a column?"}}:::ask
    GEO{{"Proximity on (lat, lon)?"}}:::ask
    TEXT{{"Substring inside TEXT?"}}:::ask
    WRITE{{"Writes 10× reads?"}}:::ask

    BTREE["<img src="/icons/mdi-file-tree.svg" alt="mdi:file-tree" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> B-tree<br/><span class='sub'>your default</span>"]:::highlight
    LSM["<img src="/icons/mdi-database-arrow-down.svg" alt="mdi:database-arrow-down" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> LSM tree<br/><span class='sub'>time-series, logs</span>"]:::ok
    HASH["<img src="/icons/mdi-pound.svg" alt="mdi:pound" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Hash<br/><span class='sub'>in-memory, no range</span>"]:::warn
    GEOHASH["<img src="/icons/mdi-map-marker.svg" alt="mdi:map-marker" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Geohash<br/><span class='sub'>string + B-tree</span>"]:::ok
    RTREE["<img src="/icons/mdi-vector-square.svg" alt="mdi:vector-square" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> R-tree / PostGIS<br/><span class='sub'>points + polygons</span>"]:::ok
    INV["<img src="/icons/logos-elasticsearch.svg" alt="logos:elasticsearch" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Inverted index<br/><span class='sub'>tsvector, search</span>"]:::ok

    Q --> EQ
    Q --> RANGE
    Q --> GEO
    Q --> TEXT
    Q --> WRITE

    EQ  ==>|"yes"| BTREE
    EQ  -->|"in-RAM, no sort"| HASH
    RANGE ==> BTREE
    GEO -->|"Postgres / Redis"| GEOHASH
    GEO -->|"production GIS"| RTREE
    TEXT --> INV
    WRITE --> LSM

    classDef highlight fill:#f3ebff,stroke:#7c3aed,stroke-width:1.5px,color:#4c1d95;

classDef ok stroke:#16a34a,stroke-width:1.75px;
classDef warn stroke:#d97706,stroke-width:1.75px;
classDef ask stroke:#6366f1,stroke-width:1.75px;
B-tree for almost everything; LSM when writes dominate; hash only in-memory; geohash or R-tree for 2D; inverted for full text.

One-sentence summary per type#

TypePick whenDon’t pick when
B-treeYou need equality + range + sort on diskNever wrong; default choice
LSM treeWrites ≫ reads, time-series, append-heavyUser-facing latency-critical reads
HashIn-memory, exact match, no range everAnywhere you might later need sort / range
GeohashProximity search on plain Postgres / RedisYou need polygon queries
R-tree / PostGISProduction GIS — points + shapesYou only have points and no extension
InvertedSubstring or relevance search in TEXTSmall text columns where LIKE '%x%' is cheap

Pitfalls to volunteer#

  • Over-indexing. Each index slows every write on the table; audit unused ones via pg_stat_user_indexes in Postgres.
  • Leftmost-prefix rule. (a, b, c) index does not help WHERE b = ?. A composite index is an ordered commitment.
  • Indexes on low-cardinality columns are mostly useless. A B-tree on status with three values (pending, active, done) rarely beats a seq scan.
  • Functional indexes. If you query LOWER(email) = ?, you need CREATE INDEX ON users (LOWER(email)) — a regular index on email won’t be used.
  • NULL semantics. Most indexes do store NULLs (Postgres, MySQL B-tree) but some operators won’t use them; WHERE col IS NOT NULL usually bypasses the index unless you have a partial index.
  • Partial indexes. CREATE INDEX ON orders (user_id) WHERE status = 'open' keeps the index small when the hot query only looks at a sliver of the table — Postgres specialty.
  • Writes during read queries. Long-running analytical reads on an LSM-heavy table can pin old SSTables on disk; compaction can’t reclaim space. Watch tombstone accumulation in Cassandra.
  • Index-only scans. Postgres only uses them if the visibility map says the page is all-visible. A covering index that looks covering still hits the heap on recently-updated rows. Vacuum matters.

Indexes are a single abstraction with many implementations, each tuned to a different workload shape. You’ll be fine in 95% of interviews by answering B-tree and defending it; the other 5% is recognizing the shape of the problem — heavy writes (LSM), proximity (geohash/R-tree), substring (inverted) — and reaching for the right tool. Everything else is optimization.

Database Indexing
https://blog.viethx.com/posts/database-indexing/
Author
Thomas Engineer
Published at
2026-04-22
License
CC BY-NC-SA 4.0
More in Core Concepts