Scaling Reads - Thomas Engineer

Most systems are read-heavy. A social feed, a commerce catalog, a dashboard — the read ratio lives between 10:1 and 10,000:1. The harder scaling questions in interviews are almost always about reads, because one write fans out into many derived reads that all need to be fast.

This post is the toolkit. Each section is a lever you can pull, with the trade-off that lever costs you and the production system that made it famous.

How to use this: the order is intentional — cheapest levers first. Pull them in order until the numbers stop hurting. Jumping straight to “shard everything + Kafka” without checking if a connection pool fix would solve it is the most common interview mistake.

1. Start with the math, not the architecture#

Before any architecture talk, write the four numbers on the whiteboard:

1
QPS     — queries per second, p50 and p99 peak
2
latency — what the user tolerates (e.g., 200 ms p99)
3
payload — bytes per query × QPS = bandwidth
4
ratio   — reads / writes (often 100:1 for feeds, 10,000:1 for catalogs)

One insight that reframes every interview: a single write at the top of a fan-out chain can drive hundreds of reads downstream. Twitter’s 2010 stack serviced 300k tweets/day but 2B timeline reads/day — a 7000:1 amplification, entirely because one tweet showed up in thousands of followers’ timelines.

Read amplification from one tweet

flowchart LR
    W["<img src="/icons/mdi-send.svg" alt="mdi:send" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> 1 tweet<br/><span class='sub'>WRITE</span>"]:::warn
    FOLLOW["<img src="/icons/mdi-account-multiple.svg" alt="mdi:account-multiple" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> 5,000 followers<br/><span class='sub'>fan-out</span>"]:::highlight
    R1["<img src="/icons/mdi-cellphone.svg" alt="mdi:cellphone" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Timeline 1<br/><span class='sub'>read</span>"]:::ok
    R2["<img src="/icons/mdi-cellphone.svg" alt="mdi:cellphone" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Timeline 2<br/><span class='sub'>read</span>"]:::ok
    RDOTS["<img src="/icons/mdi-cellphone.svg" alt="mdi:cellphone" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> …<br/><span class='sub'>read</span>"]:::ok
    R5K["<img src="/icons/mdi-cellphone.svg" alt="mdi:cellphone" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Timeline 5000<br/><span class='sub'>read</span>"]:::ok

    W ==>|"1 write"| FOLLOW
    FOLLOW --> R1
    FOLLOW --> R2
    FOLLOW --> RDOTS
    FOLLOW --> R5K

    classDef highlight fill:#f3ebff,stroke:#7c3aed,stroke-width:1.5px,color:#4c1d95;

classDef ok stroke:#16a34a,stroke-width:1.75px;
classDef warn stroke:#d97706,stroke-width:1.75px;

One write at the top of a fan-out chain drives thousands of downstream reads — the 7000:1 ratio Twitter measured in 2010.

Interview move: always state the read-amplification factor out loud before picking an architecture. It disciplines your thinking (“do I actually need to shard, or is this a cache problem?”) and impresses interviewers.

2. Pull the boring levers first#

Before a single new box enters the architecture, three wins are hiding in plain sight:

Bigger primary. A Postgres box with 32 vCPU and 256 GB RAM handles tens of thousands of QPS if your working set fits in memory. Vertical scaling is mocked but it’s cheap — Stack Overflow famously runs on nine servers (2 SQL Servers, a handful of web, some Redis) serving 5 billion pageviews a year. The limit isn’t capacity, it’s the Universal Scalability Law (Gunther, 2007), which models how contention (α) and coherence cost (β) cause throughput to peak at a finite number of workers:

C(N) \;=\; \frac{N}{1 + \alpha(N - 1) + \beta N (N - 1)}

For most workloads, you hit the knee around 8–32 cores. Above that, the linear increase in contention eats the gains. That’s why “bigger box” is a real lever and has a ceiling.

Connection pooling — the Little’s Law corollary. A single Postgres connection is a dedicated OS process that reserves 5–10 MB of RAM plus a work_mem allocation per sort/hash and its share of shared buffers (see the pgBouncer FAQ and CitusData’s analysis). At 5,000 app pods each opening 10 connections, you’ve allocated 50,000 connections at the DB — more than the server can accept, and the RAM footprint alone could exceed your entire buffer cache. Put PgBouncer (or a JDBC pool, or AWS RDS Proxy) in front and cap concurrent active connections to 100–500.

The math is Little’s Law (Little, 1961): L = λ × W. If your arrival rate λ is 5,000 req/s and each query holds a connection for W = 20 ms, the average number L of busy connections is only 5000 × 0.02 = 100. You need 100 connections to serve this load, not 50,000. Every connection beyond the steady-state L wastes memory and file descriptors without helping throughput.

But — what if every connection in the pool is busy? This is the scenario real operators lose sleep over, because the failure mode cascades faster than most other outages. Picture a 100-connection pool at 100% utilization: query #101 has no connection to grab, and what happens next depends entirely on how your pool is configured.

Connection pool exhaustion cascade

sequenceDiagram
    autonumber
    participant C as Client
    participant LB as Load Balancer
    participant APP as App pod
    participant POOL as Connection pool
    participant DB as Postgres

    C->>LB: request 101
    LB->>APP: forward
    APP->>POOL: acquire connection
    Note over POOL: all 100 in use, block

    rect rgb(254,243,199)
    Note over POOL: waiting - max_wait 5s
    end

    rect rgb(254,226,226)
    POOL-->>APP: timeout - PoolExhaustedException
    APP-->>LB: HTTP 503 or 504 after 30s
    Note over APP: app thread still held by earlier queued requests
    end

    rect rgb(254,226,226)
    C->>LB: retry - and so do 1000 others
    Note over LB,APP: retry storm amplifies load, pool stays exhausted longer
    end

    rect rgb(220,252,231)
    Note over APP,DB: Mitigations - fail-fast pool timeout, circuit breaker, 503 Retry-After, drop requests at LB above 90 percent
    end

What happens at request 101 when the pool has 100 connections and the retry storm pins the DB until someone drains the queue.

The 5-stage cascade when you don’t defend against this:

Queue at the pool. HikariCP / PgBouncer holds new requests up to max_wait_time (default 30 s). During this window, app threads are pinned waiting for connections — memory and file descriptors stay allocated.
App worker starvation. Every queued request consumes a thread. If your app serves 1,000 req/s and each blocks 5 s waiting, you need 5,000 threads — most servers topple at a few hundred.
Load-balancer backup. The LB’s keep-alive queue to the app fills. TCP accept-queues overflow. New clients get connection refused before they even reach the app.
Client retry storm. Every failed request retries, often exponentially. A 10% error rate turns into 30% within seconds because retries add to the next wave of traffic.
The DB never recovers on its own — even if downstream latency fixes itself, the queued requests keep the pool saturated until either clients give up or someone drains the queue.

Defensive patterns that actually work:

Fail fast, not forever. Set pool_timeout = 500ms, not 30s. Return 503 early so the client can give up before its own timeout.
Retry-After + jitter. When returning 503, include Retry-After: 5 so well-behaved clients back off. Clients that ignore it (usually internal services) need their own retry budget with jitter.
Reject at the LB. Nginx limit_conn / HAProxy maxconn rejects new connections when the upstream is saturated, preventing the LB itself from queueing.
Circuit breaker on the caller side. When 50% of requests to a downstream fail with pool timeouts, open the breaker and stop calling for 30 s (see Networking Essentials §6).
Monitor pool.waitCount + pool.waitDuration. These are leading indicators. Alert at 90% utilization, not at exhaustion.

Real incident worth memorizing — Heroku, Oct 2021. During a planned Postgres follower failover, thousands of tenants’ PgBouncer instances hit connection pool exhaustion simultaneously. Clients retried, retries amplified demand, and the incident lasted ~90 minutes across the platform. Postmortem: fail-fast timeouts were set too generously (30s); the default let queuing consume app threads faster than the failover could complete. Heroku’s fix: tighten default timeouts + surface pool.waitCount in tenant dashboards. Status post for reference.

Make the slow queries fast. Pull pg_stat_statements, sort by total_exec_time, look at the top 10. Often 1–2 queries missing an index consume half the database CPU. Fixing those buys months of runway before any replica or cache is needed (see §3-4 of Database Indexing).

3. Read replicas — the first real scaling lever#

Once a single primary saturates, split reads and writes. Writes go to the primary; reads round-robin across N read replicas that stream the WAL from the primary.

Primary + read replicas

flowchart LR
    APP["<img src="/icons/mdi-server.svg" alt="mdi:server" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> application pods<br/><span class='sub'>stateless</span>"]:::client

    subgraph DB ["Database tier"]
        direction TB
        PRI["<img src="/icons/logos-postgresql.svg" alt="logos:postgresql" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Primary<br/><span class='sub'>writes + reads</span>"]:::highlight
        R1["<img src="/icons/logos-postgresql.svg" alt="logos:postgresql" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Replica 1<br/><span class='sub'>read-only</span>"]:::database
        R2["<img src="/icons/logos-postgresql.svg" alt="logos:postgresql" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Replica 2<br/><span class='sub'>read-only</span>"]:::database
        R3["<img src="/icons/logos-postgresql.svg" alt="logos:postgresql" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Replica 3<br/><span class='sub'>read-only</span>"]:::database
        PRI -.->|"WAL stream (async)"| R1
        PRI -.->|"WAL stream"| R2
        PRI -.->|"WAL stream"| R3
    end

    APP ==>|"INSERT / UPDATE"| PRI
    APP -->|"SELECT"| R1
    APP -->|"SELECT"| R2
    APP -->|"SELECT"| R3

    classDef highlight fill:#f3ebff,stroke:#7c3aed,stroke-width:2px,color:#4c1d95;

    classDef clusterOrange fill:transparent,stroke:#f97316,stroke-dasharray:6 4
    class DB clusterOrange

classDef client stroke:#64748b,stroke-width:1.75px;
classDef database stroke:#f97316,stroke-width:1.75px;

Writes go to the primary; reads round-robin across N replicas that stream the WAL asynchronously. Adding replicas scales reads but not writes.

Trade-off — the CAP and PACELC framing. Replicas make explicit a trade-off that’s always there, just hidden on a single node: during any network delay you can have strong consistency or availability of every replica, not both (CAP theorem, Brewer 2000; formally proved by Gilbert & Lynch, 2002). Real systems extend this via PACELC (Abadi 2012): during a partition (P) pick A vs C; else (E) pick L (low latency) vs C (strong consistency). Postgres streaming replication is PA/EL by default — it sacrifices consistency during partitions and for latency when things are healthy, in exchange for availability and speed. Switching to synchronous_commit = remote_apply flips it to PC/EC — available only when a quorum of replicas is up.

Concretely — replication lag. A write commits on the primary; the WAL takes 10–500 ms to ship and apply. If your next read lands on a replica that hasn’t caught up, you see stale data — the classic “I just posted a comment and it disappeared” bug. This isn’t a Postgres quirk; it’s the A in PACELC manifesting on every read.

Production patterns for read-your-writes:

Sticky routing for a session window — after a user writes, route their next N seconds of reads back to the primary. GitHub and Shopify both do this.
Write-log cookie — the app gets the primary’s log position (LSN in Postgres) after a write; subsequent reads carry the LSN and the query router waits for a replica to catch up or falls back to the primary.
Synchronous replicas for critical paths — synchronous_commit = remote_apply makes the primary block until at least one replica has applied the WAL. Durability win, p99 cost. Financial systems do this; social networks don’t.

Vitess (at YouTube, Slack, GitHub) is the industry-standard router for this pattern: it speaks the MySQL wire protocol, routes queries by rule, and handles connection pooling + read-your-writes in one box.

LinkedIn’s Espresso takes this even further. Their feed and profile services run on a custom NoSQL store on top of MySQL where the read path is split into “read-from-local-region” (eventually consistent) and “read-from-master” (strongly consistent). Applications opt into the consistency level per-endpoint: a profile view is eventually consistent; confirming a just-sent connection request reads from master. Same pattern, just exposed as a first-class API.

When replicas stop helping: write throughput now caps you. Replicas all replay the same WAL, so adding more replicas speeds reads but not writes. At that point you’re sharding (see §8 and my forthcoming sharding post).

4. Caching — the other 10× lever#

A cache hit is an index lookup in RAM. A DB query is (best case) an index lookup that traverses to a heap page on SSD. The speed ratio is ~100:1 — and the reason every latency number matters is “The Tail at Scale” (Dean & Barroso, 2013). Their central result: when a single page requires 100 parallel backend calls, a 99th-percentile backend latency of 100 ms becomes the 63rd percentile page latency. Tail latency, not mean, dominates user-visible performance. Caches attack this by eliminating the bottom-percentile DB round-trips almost entirely.

Latency numbers every engineer should know#

Adapted from Jeff Dean’s canonical table. Internalizing these is how you reason about cache design:

Operation	Time	Analogy (1 ns = 1 s)
L1 cache reference	0.5 ns	0.5 s
L2 cache reference	7 ns	7 s
Main memory reference	100 ns	2 min
Compress 1 KB (snappy)	3 µs	1 hour
Send 1 KB over 1 Gbps	10 µs	3 hours
SSD random read	150 µs	2 days
Read 1 MB sequentially from memory	250 µs	3 days
Round trip within a datacenter	500 µs	6 days
Read 1 MB sequentially from SSD	1 ms	12 days
Disk seek (rotational)	10 ms	4 months
Read 1 MB sequentially from disk	20 ms	8 months
Cross-continent round trip	150 ms	5 years

The takeaway: in-memory cache is ~1,500× faster than SSD random reads; same-DC RTT is roughly one SSD seek. When you trade a network hop for avoiding a DB query, the math almost always wins.

Real systems layer caches; each layer catches what the layer below missed:

Cache layers — each catches what the layer below missed

flowchart TD
    USER["<img src="/icons/mdi-account.svg" alt="mdi:account" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> User"]:::client
    CDN["① CDN / Edge<br/><span class='sub'>ex. Cloudflare</span>"]:::edge
    EDGE["② Regional<br/><span class='sub'>ex. EVCache</span>"]:::cache
    APPC["③ Application<br/><span class='sub'>ex. Redis</span>"]:::cache
    LOCAL["④ In-process<br/><span class='sub'>ex. Caffeine</span>"]:::cache
    DB["⑤ Database<br/><span class='sub'>source of truth</span>"]:::database

    USER -->|"90-99% hit"| CDN
    CDN -->|"on miss"| EDGE
    EDGE -->|"on miss"| APPC
    APPC -->|"on miss"| LOCAL
    LOCAL -->|"on miss"| DB
    DB -.->|"backfill on read"| APPC

classDef client stroke:#64748b,stroke-width:1.75px;
classDef edge stroke:#3b82f6,stroke-width:1.75px;
classDef database stroke:#f97316,stroke-width:1.75px;
classDef cache stroke:#10b981,stroke-width:1.75px;

CDN absorbs 90-99% of static asset traffic; regional and application caches handle keyed reads; the DB sees only the long-tail miss.

Case studies per layer#

Facebook TAO (③ application cache) — Facebook’s entire social graph reads go through TAO, a memcached-based read-through cache in front of MySQL. At peak, TAO serves billions of reads per second with a 99%+ hit ratio; only the 1% miss ever touches MySQL. The social graph would be physically impossible to serve without this layer.

Netflix EVCache (② regional) — Memcached clusters replicated across AZs. The Netflix homepage for a single user aggregates results from 20+ microservices; EVCache holds each microservice’s result per-user so the homepage builds in <100 ms even when individual services are slow. At peak it serves tens of millions of ops/sec per region.

Cloudflare / Fastly (① CDN) — Stack Overflow caches their rendered HTML at the edge with a 1-minute TTL + stale-while-revalidate. The result: the primary origin sees ~5% of total traffic. Every question page and tag feed is served from the CDN unless it’s been edited.

In-process LRU (④) — commonly forgotten. A 100 MB Caffeine cache inside every pod catches the top ~1000 hot keys with zero network hop. Used heavily at Netflix (Hystrix) and Google (Guava Cache). Trade-off: consistency across pods is loose — each pod has its own copy; invalidation is harder.

Wikipedia’s Varnish + Memcached stack is the textbook CDN-plus-object-cache architecture. Every article page first tries Varnish (edge HTTP cache, 5-minute TTL) for unauthenticated readers — that layer alone serves over 90% of Wikipedia’s page views. Misses hit MediaWiki, which checks Memcached for parsed page fragments before rendering from MariaDB. The database sees only a trickle of actual reads; for a site hosting tens of billions of monthly views, this is how the operation stays affordable on donations. (Wikimedia infrastructure docs.)

Shopify Black Friday stresses every layer simultaneously. Their team pre-warms the CDN edges and Redis hot-key caches for the top 10k merchants the week before peak, simulates traffic in “Shopify Resiliency Engineering” game days, and monitors cache hit ratios as a first-class SLO. During the 2023 BFCM weekend they served 75.5 million orders and $9.3 B in sales without a primary-database scale event — almost entirely because the read tier absorbed the spike.

When caches go wrong#

A cache is a distributed copy of your database that will, eventually, disagree with it. The five failure modes the interviewer wants to hear:

Stale reads — TTL expired but data changed mid-window. Accept for feeds; fatal for balances.
Cache stampede (thundering herd) — a popular key expires; 10,000 clients all miss and hit the DB simultaneously. Fix: request coalescing (singleflight), stale-while-revalidate, or add a small random jitter to TTLs.
Hot-key overload — one celebrity account’s profile is 5% of global traffic; the single Redis shard holding it melts. Fix: client-side shadow replicas of hot keys, or a local LRU layer in front.
Cache penetration — a bad actor queries keys that don’t exist; every miss hits the DB. Fix: cache negative responses with a shorter TTL, or a bloom filter at the cache layer.
Cache avalanche — power cycle + all TTLs expire simultaneously. Fix: warm the cache before serving traffic, staggered TTLs.

5. Cache invalidation — pick your poison#

The hard part isn’t caching; it’s knowing when the cached value is wrong. Four canonical strategies, each with a production champion:

Four cache invalidation strategies

flowchart TD
    Q["<img src="/icons/mdi-help-circle.svg" alt="mdi:help-circle" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> New cached value<br/><span class='sub'>which strategy?</span>"]:::ask

    TTL["<img src="/icons/mdi-timer-outline.svg" alt="mdi:timer-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> TTL<br/><span class='sub'>value + expiry</span>"]:::step
    WT["<img src="/icons/mdi-database-arrow-right.svg" alt="mdi:database-arrow-right" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Write-through<br/><span class='sub'>cache + DB, one txn</span>"]:::step
    WB["<img src="/icons/mdi-database-clock.svg" alt="mdi:database-clock" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Write-behind<br/><span class='sub'>DB catches up async</span>"]:::step
    INV["<img src="/icons/mdi-bell-ring.svg" alt="mdi:bell-ring" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Event-driven<br/><span class='sub'>subscribers invalidate</span>"]:::step

    Q --> TTL
    Q --> WT
    Q --> WB
    Q --> INV

    TTL -.->|"simple, slightly stale"| OUT1["<img src="/icons/mdi-view-dashboard.svg" alt="mdi:view-dashboard" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Dashboards<br/><span class='sub'>read-heavy</span>"]:::ok
    WT  -.->|"strict, slower writes"| OUT2["<img src="/icons/mdi-cart.svg" alt="mdi:cart" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Pinterest, Shopify<br/><span class='sub'>board / cart</span>"]:::warn
    WB  -.->|"fast writes, crash loss"| OUT3["<img src="/icons/mdi-chart-line.svg" alt="mdi:chart-line" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Metrics<br/><span class='sub'>analytics</span>"]:::warn
    INV -.->|"strong + scalable, complex"| OUT4["<img src="/icons/mdi-car.svg" alt="mdi:car" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Uber, Facebook TAO<br/><span class='sub'>trip / graph</span>"]:::ok

classDef ok stroke:#16a34a,stroke-width:1.75px;
classDef warn stroke:#d97706,stroke-width:1.75px;
classDef step stroke:#3b6fd6,stroke-width:1.75px;
classDef ask stroke:#6366f1,stroke-width:1.75px;

Each strategy pays in a different coin — latency, staleness, durability, or infra complexity. Pick by what the business can tolerate.

TTL — simplest. SET key "..." EX 60. Works for 80% of cases. Accept 60 seconds of staleness; gain infinite simplicity.

Write-through — write hits both cache and DB in one transaction. Pinterest uses this for board state: you pin a photo, the cached board view updates before the HTTP response returns. Cost: every write is now 2× slower.

Write-behind (write-back) — write to cache first, return 200, sync to DB asynchronously. Metrics collectors (Datadog, Prometheus) do this for throughput; accept losing the last second of data on crash.

Event-driven invalidation — the DB (or its CDC stream) emits a changed(key=X) event; cache subscribers purge that key. Uber invalidates trip-state caches this way: a driver status update goes to Kafka, cache invalidators consume it and purge the relevant keys within ~100 ms. Facebook TAO uses a similar pattern for social graph edges. Most scalable and most complex — you need an event bus, consumers, and careful ordering.

6. Fan-out — the classic timeline problem#

When one action (a post, a like, a purchase) must appear in many users’ personalized views, you face the timeline problem. Two architectures:

Fan-out: push vs pull

flowchart LR
    subgraph FOW ["① Fan-out on write (push)"]
        direction TB
        P1["<img src="/icons/mdi-pencil.svg" alt="mdi:pencil" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Alice posts"]:::warn
        W1["<img src="/icons/mdi-cog.svg" alt="mdi:cog" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Writer<br/><span class='sub'>fan-out</span>"]:::compute
        INBOX1["<img src="/icons/mdi-inbox.svg" alt="mdi:inbox" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Bob's timeline"]:::storage
        INBOX2["<img src="/icons/mdi-inbox.svg" alt="mdi:inbox" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Carol's timeline"]:::storage
        INBOXN["<img src="/icons/mdi-inbox.svg" alt="mdi:inbox" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> … 5M timelines"]:::storage
        P1 --> W1
        W1 --> INBOX1
        W1 --> INBOX2
        W1 --> INBOXN
    end

    subgraph FOR ["② Fan-out on read (pull)"]
        direction TB
        P2["<img src="/icons/mdi-pencil.svg" alt="mdi:pencil" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Alice posts"]:::ok
        DB["<img src="/icons/mdi-database-outline.svg" alt="mdi:database-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> posts table<br/><span class='sub'>one row</span>"]:::database
        READ["<img src="/icons/mdi-account.svg" alt="mdi:account" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Bob opens app<br/><span class='sub'>merge + sort</span>"]:::ok
        P2 --> DB
        READ --> DB
    end

    classDef clusterRed fill:transparent,stroke:#dc2626,stroke-dasharray:6 4
    classDef clusterGreen fill:transparent,stroke:#16a34a,stroke-dasharray:6 4
    class FOW clusterRed
    class FOR clusterGreen

classDef compute stroke:#7c3aed,stroke-width:1.75px;
classDef database stroke:#f97316,stroke-width:1.75px;
classDef storage stroke:#f59e0b,stroke-width:1.75px;
classDef ok stroke:#16a34a,stroke-width:1.75px;
classDef warn stroke:#d97706,stroke-width:1.75px;

Push pre-computes every follower's timeline on write — O(followers) per post but O(1) reads. Pull gathers followees' posts at read — cheap writes, scatter-gather reads.

	Fan-out on write	Fan-out on read
Write cost	O(followers) — massive	O(1)
Read cost	O(1) — timeline is precomputed	O(people you follow) — scatter-gather
Storage	N × followers × posts	N × posts
Wins when	Most users have few followers, reads far exceed writes	Users follow few accounts; write amplification infeasible

Twitter (2013-ish) — famously hit a wall on pure fan-out on write. A tweet from Taylor Swift (100M followers) wrote 100M inbox entries. Their answer:

The hybrid — fan-out on write + fan-out on read for celebrities.

Hybrid fan-out — push for normals, pull for celebrities

flowchart TD
    POST["<img src="/icons/mdi-pencil.svg" alt="mdi:pencil" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Alice posts"]:::step
    CHECK{{"Celebrity?<br/><span class='sub'>>1M followers</span>"}}:::ask

    WRITE["<img src="/icons/mdi-account-multiple.svg" alt="mdi:account-multiple" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Fan-out on write<br/><span class='sub'>push to inboxes</span>"]:::ok
    NOWRITE["<img src="/icons/mdi-close-octagon.svg" alt="mdi:close-octagon" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Skip fan-out<br/><span class='sub'>too expensive</span>"]:::warn

    READ["<img src="/icons/mdi-account.svg" alt="mdi:account" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Bob opens app"]:::step
    INBOX["<img src="/icons/mdi-inbox.svg" alt="mdi:inbox" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Precomputed timeline<br/><span class='sub'>regular posts</span>"]:::ok
    MERGE["<img src="/icons/mdi-merge.svg" alt="mdi:merge" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Merge celebrity posts<br/><span class='sub'>fetched on read</span>"]:::highlight
    TIMELINE["<img src="/icons/mdi-cellphone.svg" alt="mdi:cellphone" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Final timeline"]:::step

    POST --> CHECK
    CHECK -->|"no (regular)"| WRITE
    CHECK -->|"yes (celebrity)"| NOWRITE

    READ --> INBOX
    READ --> MERGE
    INBOX --> TIMELINE
    MERGE --> TIMELINE

    classDef highlight fill:#f3ebff,stroke:#7c3aed,stroke-width:1.5px,color:#4c1d95;

classDef ok stroke:#16a34a,stroke-width:1.75px;
classDef warn stroke:#d97706,stroke-width:1.75px;
classDef step stroke:#3b6fd6,stroke-width:1.75px;
classDef ask stroke:#6366f1,stroke-width:1.75px;

Cap fan-out-on-write at ~1M followers; anyone over that writes once and readers pull at render time. Keeps the amplification bounded.

Real-world hybrid examples:

Instagram — identical strategy; the “Explore” feed is also fan-out on read over ranked candidate posts.
LinkedIn feed — fan-out on write to a per-user “follow feed” using Kafka + Samza stream processing. When a user posts, Samza fans the event out to follower inboxes within seconds. Ranking and personalization happen at read time on top of the fanned-out candidate set. Scale: billions of feed updates per day across hundreds of millions of members.
Discord channel messages — fan-out on read (one channel, many subscribers). Scales because channels are small (usually <1k active members) and Cassandra/ScyllaDB handles the per-channel partition cheaply.
Slack — fan-out on write for mentions (push notifications + badge counts), fan-out on read for channel history. Hybrid chosen per message type; mentions rarely exceed a few dozen recipients, channel scrollback can be thousands.
Tinder’s Swipe API — a specialized variant. Each user’s “deck” is precomputed (fan-out on write when new candidates match filters), stored in Redis as a sorted set. Swiping pops the head of the list and pushes telemetry into Kafka for re-ranking. This is fan-out-on-write applied to a personalized queue rather than a social feed.

The interview answer: “I’d default to fan-out on write for the normal case and switch to fan-out on read for high-fanout outliers. I’d cap the fan-out write to ~10k followers and mark accounts above that as ‘pull’ so the math stays bounded.”

7. Denormalization, materialized views, and approximation#

When you can’t shrink the read load, move the work to write time.

Precomputed read models#

Reddit’s hot feed — Reddit maintains sorted sets in Redis keyed by subreddit, with score = hotness function (log10(upvotes) + age_factor). Every upvote updates the sorted set score. Rendering the hot page is ZREVRANGE subreddit:pics 0 25 — a single 1 ms Redis call, not a DB aggregation.

YouTube’s watch history — every play writes to a per-user materialized list (“recently watched”), kept in Bigtable with a TTL. The alternative, aggregating play events from an event stream at read time, would take seconds per user.

Approximate counters — trade accuracy for memory#

Exact counts on popular objects are an anti-pattern — every view or like becomes a DB write. The academic answer is a family of probabilistic data structures that compress unbounded sets into fixed-size sketches with provable error bounds.

HyperLogLog (Flajolet, Fusy, Gandouet, Meunier, 2007) — counts unique elements from a stream with standard error 1.04 / √m where m is the register count. The default m = 2^14 buckets (16 KB) gives ≈0.8% error on cardinalities up to 10⁹. YouTube’s billion-view counters, Redis’s PFCOUNT, and Google BigQuery’s APPROX_COUNT_DISTINCT are all HLL.
Bloom filters (Bloom, 1970) — test set membership with zero false negatives and bounded false positives. Given n items, m bits, and k hash functions, false-positive rate is (1 − e^(−kn/m))^k, minimized when k ≈ (m/n) ln 2. Cassandra uses a bloom filter per SSTable to skip disk reads; Medium uses them to detect articles you’ve already read.
Count-min sketch (Cormode & Muthukrishnan, 2005) — approximate frequency counts in sub-linear space; what Redis uses internally for its CMSKETCH commands and what ad-tech uses for fraud counters.

Practical examples:

YouTube view count — for videos under ~1000 views, exact. Above that, HyperLogLog estimates ±2% from a tiny (<1 KB) sketch. The homepage shows “1.2M views” — nobody cares that it’s actually 1,199,834.
Twitter like counts — buffered in Redis with periodic flush; exact within the last few seconds at most.
Instagram story views — recent activity is exact, historical is summarized.

The general principle: a deterministic query over billions of events is linear work. A sketch is constant-space, sub-linear update — you pay O(1) per insert and O(1) per query, accepting bounded error. For read-heavy analytics where “close enough” is fine, this is how you escape the write-amplification trap entirely.

1
-- Instead of O(plays) sums every request, precompute:
2
CREATE MATERIALIZED VIEW video_play_counts AS
3
SELECT video_id, COUNT(*) AS plays
4
FROM   plays
5
WHERE  created_at > NOW() - INTERVAL '30 days'
6
GROUP  BY video_id;
7

8
CREATE UNIQUE INDEX ON video_play_counts(video_id);
9
REFRESH MATERIALIZED VIEW CONCURRENTLY video_play_counts;  -- every 5 min via cron

Readers get SELECT plays FROM video_play_counts WHERE video_id = ? — a single indexed row hit instead of an aggregation over billions of events. Trade 5-minute staleness for 3-order-of-magnitude speedup.

8. CQRS — two data models, one truth#

If your read shape fundamentally differs from your write shape, stop forcing a single schema. Command Query Responsibility Segregation (CQRS) splits them:

Write side — normalized, transactional, source of truth (Postgres, DynamoDB).
Read side — denormalized, eventually consistent, optimized for each query shape (Elasticsearch for search, Redis for hot paths, OLAP store for dashboards).
A CDC stream (Debezium, DynamoDB Streams, Kafka) propagates writes from source of truth to read models.

CQRS — one source of truth, many read models

flowchart LR
    U["<img src="/icons/mdi-account.svg" alt="mdi:account" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Writer<br/><span class='sub'>POST /order</span>"]:::client
    POSTGRES["<img src="/icons/logos-postgresql.svg" alt="logos:postgresql" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Postgres<br/><span class='sub'>source of truth</span>"]:::highlight
    CDC["{iconify:logos:apache-kafka} CDC stream<br/><span class='sub'>ex. Debezium</span>"]:::queue

    ES["<img src="/icons/logos-elasticsearch.svg" alt="logos:elasticsearch" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Elasticsearch<br/><span class='sub'>search</span>"]:::search
    REDIS["<img src="/icons/logos-redis.svg" alt="logos:redis" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Redis<br/><span class='sub'>hot lookups</span>"]:::cache
    OLAP["<img src="/icons/mdi-chart-line.svg" alt="mdi:chart-line" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> ClickHouse<br/><span class='sub'>analytics</span>"]:::storage

    R1["reader: search"]:::client
    R2["reader: profile"]:::client
    R3["reader: dashboard"]:::client

    U ==>|"POST /order"| POSTGRES
    POSTGRES ==>|"WAL changes"| CDC
    CDC --> ES
    CDC --> REDIS
    CDC --> OLAP

    R1 --> ES
    R2 --> REDIS
    R3 --> OLAP

    classDef highlight fill:#f3ebff,stroke:#7c3aed,stroke-width:2px,color:#4c1d95;

classDef client stroke:#64748b,stroke-width:1.75px;
classDef cache stroke:#10b981,stroke-width:1.75px;
classDef queue stroke:#a855f7,stroke-width:1.75px;
classDef storage stroke:#f59e0b,stroke-width:1.75px;
classDef search stroke:#0ea5e9,stroke-width:1.75px;

Writes land in the transactional store; CDC fans changes into purpose-built read stores (search, hot lookups, OLAP) shaped for each query.

Real production uses:

Airbnb search — Postgres is the source of truth for listings; changes flow via Kafka → Elasticsearch. Search queries never touch Postgres directly.
Booking.com search — similar pattern at larger scale. The write path is a monolith MySQL cluster for inventory; the read path ships through a Kafka-backed pipeline into a custom Lucene-based index that serves hundreds of thousands of searches per second. Availability, not consistency, is the SLO — a booking attempt re-reads the master inventory for the final atomic “hold” step.
Uber trip lookups — write path is simple (MySQL); the read path is sharded by city-grid + aggregated into Redis and ELK for ops dashboards.
Netflix’s homepage — the aggregation layer (CDS) queries ~20 backend services and stitches the response. Each backend maintains its own read model; updates propagate via SQS/Kafka. Netflix famously runs an A/B test on every homepage render — the read path is essentially a CQRS read model materialized per user per experiment arm.
Hasaki payroll calculations — compute engine writes to Postgres; employee-facing dashboards hit a denormalized per-month Redis precompute, never the transactional tables.

Trade-offs:

Eventual consistency between write and read models. Enforce SLA (“read replica is ≤5s behind”) and surface stale-data indicators in the UI.
Operational complexity — more moving parts, more failure modes, more monitoring.
Schema evolution is harder — a breaking write-side change means rebuilding every read model.

When CQRS is overkill: CRUD apps, anything a single Postgres handles in <10 ms p99. When CQRS pays off: search, analytics, anything with a dominant read-shape that would make the write schema awful (e.g., full-text search on top of a normalized relational model).

9. Cheat sheet — which lever, in what order#

Which lever, in what order

flowchart TD
    Q["<img src="/icons/mdi-speedometer.svg" alt="mdi:speedometer" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Read latency / QPS hurts"]:::ask

    CHECK1{{"Slow query + pool<br/><span class='sub'>fixed?</span>"}}:::ask
    CHECK2{{"One writer + N readers<br/><span class='sub'>enough?</span>"}}:::ask
    CHECK3{{"Same answer<br/><span class='sub'>repeated often?</span>"}}:::ask
    CHECK4{{"One write → many reads<br/><span class='sub'>timeline / feed?</span>"}}:::ask
    CHECK5{{"Read shape ≠<br/><span class='sub'>write shape?</span>"}}:::ask

    TUNE["<img src="/icons/mdi-tune.svg" alt="mdi:tune" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Tune first<br/><span class='sub'>index + pool</span>"]:::step
    REP["<img src="/icons/logos-postgresql.svg" alt="logos:postgresql" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Read replicas<br/><span class='sub'>+ read-your-writes</span>"]:::ok
    CACHE["<img src="/icons/logos-redis.svg" alt="logos:redis" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Caching tier<br/><span class='sub'>CDN + Redis + LRU</span>"]:::ok
    FAN["<img src="/icons/mdi-call-split.svg" alt="mdi:call-split" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Fan-out hybrid<br/><span class='sub'>push + pull</span>"]:::warn
    CQRS["<img src="/icons/mdi-source-branch.svg" alt="mdi:source-branch" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> CQRS + CDC<br/><span class='sub'>ES / Redis / OLAP</span>"]:::highlight

    Q --> CHECK1
    CHECK1 -->|"no"| TUNE
    CHECK1 -->|"yes"| CHECK2

    CHECK2 -->|"yes"| REP
    CHECK2 -->|"no — scaling"| CHECK3

    CHECK3 -->|"yes"| CACHE
    CHECK3 -->|"no"| CHECK4

    CHECK4 -->|"yes"| FAN
    CHECK4 -->|"no"| CHECK5

    CHECK5 -->|"yes"| CQRS
    CHECK5 -->|"no"| TUNE

    classDef highlight fill:#f3ebff,stroke:#7c3aed,stroke-width:1.5px,color:#4c1d95;

classDef ok stroke:#16a34a,stroke-width:1.75px;
classDef warn stroke:#d97706,stroke-width:1.75px;
classDef step stroke:#3b6fd6,stroke-width:1.75px;
classDef ask stroke:#6366f1,stroke-width:1.75px;

Walk down cheapest-first. Tune before replicating, cache before fanning out, fan out before CQRS. Jumping straight to the bottom is the most common interview mistake.

One-sentence summary per lever#

Lever	Pull when	Real system
Tune the query	Before everything else	Stack Overflow, 9 servers for 5B pageviews
Connection pool	App pods × open connections > DB max	PgBouncer in every serious Postgres deploy
Read replicas	Writes are fine, reads saturate CPU	GitHub MySQL, Shopify
CDN / edge cache	Same HTML/asset served to many users	Stack Overflow, every content site
App cache (Redis)	Same keyed answer served to many sessions	Facebook TAO, Netflix EVCache
Fan-out hybrid	One write amplifies to many users	Twitter, Instagram
Denormalize / MV	Aggregations over too many rows	Reddit hot feed, YouTube counters
Approximate counting	Exact count doesn’t matter	YouTube views, Instagram story views
CQRS + CDC	Read shape fundamentally ≠ write shape	Airbnb search, Uber trip lookup

Pitfalls to volunteer#

Reading the primary right after a write to a replica — every interviewer tests this. Solution: read-your-writes routing or a write-log cookie.
Cache stampede on a hot key — the Taylor Swift problem. Solution: singleflight, stale-while-revalidate, jittered TTLs, local LRU for celebrity keys.
Thundering herd on LB restart / cold cache — pre-warm caches on deploy; stagger deploys across pods.
Writing the count — any UPDATE counter SET n = n + 1 on a popular row is a single-row lock contention point. Use HLL, per-shard counters, or a write-behind with batched flush.
Over-fanning-out — a naive fan-out on write for celebrity accounts will OOM the writer. Always cap and flip to pull.
CQRS without monitoring replication lag — read models silently fall minutes behind; users file bugs about stale dashboards. Expose lag as a gauge and page when it exceeds SLA.
CDN caching personalized HTML — the cache key must include the user. Missing Vary: Cookie on a “logged in” version has leaked user data in several famous incidents.
Connection: close on every response — kills client-side pooling, pushes load back to the DB connection pool instead of HTTP.

Read scaling is a menu of levers, not a single technique. Good interview answers walk down the list: “I’d tune the slow query, add a pool, then look at replicas, then cache, and only then fan-out or CQRS.” That ordering — cheap fixes before expensive architecture — is what distinguishes a senior answer from a junior one.

10. Foundational reading#

Every lever in this post has an academic paper or canonical source behind it. Internalizing the primary sources is what separates a candidate who recites patterns from one who can derive them when the interviewer twists the problem.

On the fundamental trade-offs:

Brewer, E. (2000). Towards Robust Distributed Systems — the original CAP conjecture. (Keynote slides)
Gilbert, S. & Lynch, N. (2002). Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services — the formal proof.
Abadi, D. (2012). Consistency Tradeoffs in Modern Distributed Database System Design: CAP is Only Part of the Story — introduces PACELC. (Paper)
Fischer, M., Lynch, N. & Paterson, M. (1985). Impossibility of Distributed Consensus with One Faulty Process — “FLP impossibility”; why consensus is hard.

On tail latency and modeling:

Dean, J. & Barroso, L. A. (2013). The Tail at Scale — ACM article that reframes how we think about p99 at scale. (PDF)
Little, J. D. C. (1961). A Proof for the Queuing Formula: L = λW — original queueing-theory result the whole concurrency industry rests on.
Gunther, N. (2007). Guerrilla Capacity Planning — the Universal Scalability Law. Companion to any capacity discussion.
Amdahl, G. (1967). Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities — Amdahl’s Law; why parallelization hits a ceiling.

On caching + storage:

Bloom, B. H. (1970). Space/Time Trade-offs in Hash Coding with Allowable Errors — Bloom filter.
Flajolet, P., Fusy, É., Gandouet, O., Meunier, F. (2007). HyperLogLog: The Analysis of a Near-Optimal Cardinality Estimation Algorithm. (PDF)
Cormode, G. & Muthukrishnan, S. (2005). An Improved Data Stream Summary: The Count-Min Sketch and its Applications.
Megiddo, N. & Modha, D. (2003). ARC: A Self-Tuning, Low Overhead Replacement Cache — a defensible answer to “LRU or LFU?”

On replication and eventual consistency:

DeCandia, G. et al. (2007, Amazon). Dynamo: Amazon’s Highly Available Key-Value Store — the paper that popularized eventual consistency in industry.
Corbett, J. C. et al. (2013, Google). Spanner: Google’s Globally-Distributed Database — TrueTime, external consistency.
Lloyd, W. et al. (2011). Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS.

On the industry-standard textbook:

Kleppmann, M. Designing Data-Intensive Applications (O’Reilly, 2017). Chapter 5 (replication), Chapter 6 (partitioning), Chapter 11 (stream processing). Unofficial syllabus of every senior system-design interview.
Tanenbaum, A. & Van Steen, M. Distributed Systems: Principles and Paradigms (3rd ed., 2017). Longer, denser, academically thorough.

If you can cite even two of these in an interview — and explain why a lever you picked satisfies or violates the paper’s assumptions — the level the conversation shifts to is noticeably different.