Scaling Reads
Most systems are read-heavy. A social feed, a commerce catalog, a dashboard — the read
This post is the toolkit. Each section is a lever you can pull, with the trade-off that lever costs you and the production system that made it famous.
How to use this: the order is intentional — cheapest levers first. Pull them in order until the numbers stop hurting. Jumping straight to “shard everything + Kafka” without checking if a connection pool fix would solve it is the most common interview mistake.
1. Start with the math, not the architecture
Before any architecture talk, write the four numbers on the whiteboard:
QPS — queries per second, p50 and p99 peaklatency — what the user tolerates (e.g., 200 ms p99)payload — bytes per query × QPS = bandwidthratio — reads / writes (often 100:1 for feeds, 10,000:1 for catalogs)One insight that reframes every interview: a single write at the top of a fan-out chain can drive hundreds of reads downstream. Twitter’s 2010 stack serviced 300k tweets/day but 2B timeline reads/day — a 7000:1 amplification, entirely because one tweet showed up in thousands of followers’ timelines.
flowchart LR
W["<img src="/icons/mdi-send.svg" alt="mdi:send" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> 1 tweet<br/><span class='sub'>WRITE</span>"]:::warn
FOLLOW["<img src="/icons/mdi-account-multiple.svg" alt="mdi:account-multiple" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> 5,000 followers<br/><span class='sub'>fan-out</span>"]:::highlight
R1["<img src="/icons/mdi-cellphone.svg" alt="mdi:cellphone" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Timeline 1<br/><span class='sub'>read</span>"]:::ok
R2["<img src="/icons/mdi-cellphone.svg" alt="mdi:cellphone" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Timeline 2<br/><span class='sub'>read</span>"]:::ok
RDOTS["<img src="/icons/mdi-cellphone.svg" alt="mdi:cellphone" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> …<br/><span class='sub'>read</span>"]:::ok
R5K["<img src="/icons/mdi-cellphone.svg" alt="mdi:cellphone" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Timeline 5000<br/><span class='sub'>read</span>"]:::ok
W ==>|"1 write"| FOLLOW
FOLLOW --> R1
FOLLOW --> R2
FOLLOW --> RDOTS
FOLLOW --> R5K
classDef highlight fill:#f3ebff,stroke:#7c3aed,stroke-width:1.5px,color:#4c1d95;
classDef ok stroke:#16a34a,stroke-width:1.75px;
classDef warn stroke:#d97706,stroke-width:1.75px;
Interview move: always state the read-amplification factor out loud before picking an architecture. It disciplines your thinking (“do I actually need to shard, or is this a cache problem?”) and impresses interviewers.
2. Pull the boring levers first
Before a single new box enters the architecture, three wins are hiding in plain sight:
Bigger primary. A Postgres box with 32 vCPU and 256 GB RAM handles tens of thousands of QPS if your working set fits in memory. Vertical scaling is mocked but it’s cheap — Stack Overflow famously runs on nine servers (2 SQL Servers, a handful of web, some Redis) serving 5 billion pageviews a year. The limit isn’t capacity, it’s the Universal Scalability Law (Gunther, 2007), which models how contention (α) and coherence cost (β) cause throughput to peak at a finite number of workers:
For most workloads, you hit the knee around 8–32 cores. Above that, the linear increase in contention eats the gains. That’s why “bigger box” is a real lever and has a ceiling.
Connection pooling — the Little’s Law corollary. A single Postgres connection is a dedicated OS process that reserves 5–10 MB of RAM plus a work_mem allocation per sort/hash and its share of shared buffers (see the pgBouncer FAQ and CitusData’s analysis). At 5,000 app pods each opening 10 connections, you’ve allocated 50,000 connections at the DB — more than the server can accept, and the RAM footprint alone could exceed your entire buffer cache. Put PgBouncer (or a JDBC pool, or AWS RDS Proxy) in front and cap concurrent active connections to 100–500.
The math is Little’s Law (Little, 1961): L = λ × W. If your arrival rate λ is 5,000 req/s and each query holds a connection for W = 20 ms, the average number L of busy connections is only 5000 × 0.02 = 100. You need 100 connections to serve this load, not 50,000. Every connection beyond the steady-state L wastes memory and file descriptors without helping throughput.
But — what if every connection in the pool is busy? This is the scenario real operators lose sleep over, because the failure mode cascades faster than most other outages. Picture a 100-connection pool at 100% utilization: query #101 has no connection to grab, and what happens next depends entirely on how your pool is configured.
sequenceDiagram
autonumber
participant C as Client
participant LB as Load Balancer
participant APP as App pod
participant POOL as Connection pool
participant DB as Postgres
C->>LB: request 101
LB->>APP: forward
APP->>POOL: acquire connection
Note over POOL: all 100 in use, block
rect rgb(254,243,199)
Note over POOL: waiting - max_wait 5s
end
rect rgb(254,226,226)
POOL-->>APP: timeout - PoolExhaustedException
APP-->>LB: HTTP 503 or 504 after 30s
Note over APP: app thread still held by earlier queued requests
end
rect rgb(254,226,226)
C->>LB: retry - and so do 1000 others
Note over LB,APP: retry storm amplifies load, pool stays exhausted longer
end
rect rgb(220,252,231)
Note over APP,DB: Mitigations - fail-fast pool timeout, circuit breaker, 503 Retry-After, drop requests at LB above 90 percent
endThe 5-stage cascade when you don’t defend against this:
- Queue at the pool. HikariCP / PgBouncer holds new requests up to
max_wait_time(default 30 s). During this window, app threads are pinned waiting for connections — memory and file descriptors stay allocated. - App worker starvation. Every queued request consumes a thread. If your app serves 1,000 req/s and each blocks 5 s waiting, you need 5,000 threads — most servers topple at a few hundred.
- Load-balancer backup. The LB’s keep-alive queue to the app fills. TCP accept-queues overflow. New clients get
connection refusedbefore they even reach the app. - Client retry storm. Every failed request retries, often exponentially. A 10% error rate turns into 30% within seconds because retries add to the next wave of traffic.
- The DB never recovers on its own — even if downstream latency fixes itself, the queued requests keep the pool saturated until either clients give up or someone drains the queue.
Defensive patterns that actually work:
- Fail fast, not forever. Set
pool_timeout = 500ms, not 30s. Return 503 early so the client can give up before its own timeout. Retry-After+ jitter. When returning 503, includeRetry-After: 5so well-behaved clients back off. Clients that ignore it (usually internal services) need their own retry budget with jitter.- Reject at the LB. Nginx
limit_conn/ HAProxymaxconnrejects new connections when the upstream is saturated, preventing the LB itself from queueing. - Circuit breaker on the caller side. When 50% of requests to a downstream fail with pool timeouts, open the breaker and stop calling for 30 s (see Networking Essentials §6).
- Monitor
pool.waitCount+pool.waitDuration. These are leading indicators. Alert at 90% utilization, not at exhaustion.
Real incident worth memorizing — Heroku, Oct 2021. During a planned Postgres follower failover, thousands of tenants’ PgBouncer instances hit connection pool exhaustion simultaneously. Clients retried, retries amplified demand, and the incident lasted ~90 minutes across the platform. Postmortem: fail-fast timeouts were set too generously (30s); the default let queuing consume app threads faster than the failover could complete. Heroku’s fix: tighten default timeouts + surface pool.waitCount in tenant dashboards. Status post for reference.
Make the slow queries fast. Pull pg_stat_statements, sort by total_exec_time, look at the top 10. Often 1–2 queries missing an index consume half the database CPU. Fixing those buys months of runway before any replica or cache is needed (see §3-4 of Database Indexing).
3. Read replicas — the first real scaling lever
Once a single primary saturates, split reads and writes. Writes go to the primary; reads round-robin across N read replicas that stream the WAL from the primary.
flowchart LR
APP["<img src="/icons/mdi-server.svg" alt="mdi:server" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> application pods<br/><span class='sub'>stateless</span>"]:::client
subgraph DB ["Database tier"]
direction TB
PRI["<img src="/icons/logos-postgresql.svg" alt="logos:postgresql" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Primary<br/><span class='sub'>writes + reads</span>"]:::highlight
R1["<img src="/icons/logos-postgresql.svg" alt="logos:postgresql" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Replica 1<br/><span class='sub'>read-only</span>"]:::database
R2["<img src="/icons/logos-postgresql.svg" alt="logos:postgresql" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Replica 2<br/><span class='sub'>read-only</span>"]:::database
R3["<img src="/icons/logos-postgresql.svg" alt="logos:postgresql" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Replica 3<br/><span class='sub'>read-only</span>"]:::database
PRI -.->|"WAL stream (async)"| R1
PRI -.->|"WAL stream"| R2
PRI -.->|"WAL stream"| R3
end
APP ==>|"INSERT / UPDATE"| PRI
APP -->|"SELECT"| R1
APP -->|"SELECT"| R2
APP -->|"SELECT"| R3
classDef highlight fill:#f3ebff,stroke:#7c3aed,stroke-width:2px,color:#4c1d95;
classDef clusterOrange fill:transparent,stroke:#f97316,stroke-dasharray:6 4
class DB clusterOrange
classDef client stroke:#64748b,stroke-width:1.75px;
classDef database stroke:#f97316,stroke-width:1.75px;
Trade-off — the CAP and PACELC framing. Replicas make explicit a trade-off that’s always there, just hidden on a single node: during any network delay you can have strong consistency or availability of every replica, not both (CAP theorem, Brewer 2000; formally proved by Gilbert & Lynch, 2002). Real systems extend this via PACELC (Abadi 2012): during a partition (P) pick A vs C; else (E) pick L (low latency) vs C (strong consistency). Postgres streaming replication is PA/EL by default — it sacrifices consistency during partitions and for latency when things are healthy, in exchange for availability and speed. Switching to synchronous_commit = remote_apply flips it to PC/EC — available only when a quorum of replicas is up.
Concretely — replication lag. A write commits on the primary; the WAL takes 10–500 ms to ship and apply. If your next read lands on a replica that hasn’t caught up, you see stale data — the classic “I just posted a comment and it disappeared” bug. This isn’t a Postgres quirk; it’s the A in PACELC manifesting on every read.
Production patterns for read-your-writes:
- Sticky routing for a session window — after a user writes, route their next N seconds of reads back to the primary. GitHub and Shopify both do this.
- Write-log cookie — the app gets the primary’s log position (LSN in Postgres) after a write; subsequent reads carry the LSN and the query router waits for a replica to catch up or falls back to the primary.
- Synchronous replicas for critical paths —
synchronous_commit = remote_applymakes the primary block until at least one replica has applied the WAL. Durability win, p99 cost. Financial systems do this; social networks don’t.
Vitess (at YouTube, Slack, GitHub) is the industry-standard router for this pattern: it speaks the MySQL wire protocol, routes queries by rule, and handles connection pooling + read-your-writes in one box.
LinkedIn’s Espresso takes this even further. Their feed and profile services run on a custom NoSQL store on top of MySQL where the read path is split into “read-from-local-region” (eventually consistent) and “read-from-master” (strongly consistent). Applications opt into the consistency level per-endpoint: a profile view is eventually consistent; confirming a just-sent connection request reads from master. Same pattern, just exposed as a first-class API.
When replicas stop helping: write throughput now caps you. Replicas all replay the same WAL, so adding more replicas speeds reads but not writes. At that point you’re sharding (see §8 and my forthcoming sharding post).
4. Caching — the other 10× lever
A cache hit is an index lookup in RAM. A DB query is (best case) an index lookup that traverses to a heap page on SSD. The speed ratio is ~100:1 — and the reason every latency number matters is “The Tail at Scale” (Dean & Barroso, 2013). Their central result: when a single page requires 100 parallel backend calls, a 99th-percentile backend latency of 100 ms becomes the 63rd percentile page latency. Tail latency, not mean, dominates user-visible performance. Caches attack this by eliminating the bottom-percentile DB round-trips almost entirely.
Latency numbers every engineer should know
Adapted from Jeff Dean’s canonical table. Internalizing these is how you reason about cache design:
| Operation | Time | Analogy (1 ns = 1 s) |
|---|---|---|
| L1 cache reference | 0.5 ns | 0.5 s |
| L2 cache reference | 7 ns | 7 s |
| Main memory reference | 100 ns | 2 min |
| Compress 1 KB (snappy) | 3 µs | 1 hour |
| Send 1 KB over 1 Gbps | 10 µs | 3 hours |
| SSD random read | 150 µs | 2 days |
| Read 1 MB sequentially from memory | 250 µs | 3 days |
| Round trip within a datacenter | 500 µs | 6 days |
| Read 1 MB sequentially from SSD | 1 ms | 12 days |
| Disk seek (rotational) | 10 ms | 4 months |
| Read 1 MB sequentially from disk | 20 ms | 8 months |
| Cross-continent round trip | 150 ms | 5 years |
The takeaway: in-memory cache is ~1,500× faster than SSD random reads; same-DC RTT is roughly one SSD seek. When you trade a network hop for avoiding a DB query, the math almost always wins.
Real systems layer caches; each layer catches what the layer below missed:
flowchart TD
USER["<img src="/icons/mdi-account.svg" alt="mdi:account" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> User"]:::client
CDN["① CDN / Edge<br/><span class='sub'>ex. Cloudflare</span>"]:::edge
EDGE["② Regional<br/><span class='sub'>ex. EVCache</span>"]:::cache
APPC["③ Application<br/><span class='sub'>ex. Redis</span>"]:::cache
LOCAL["④ In-process<br/><span class='sub'>ex. Caffeine</span>"]:::cache
DB["⑤ Database<br/><span class='sub'>source of truth</span>"]:::database
USER -->|"90-99% hit"| CDN
CDN -->|"on miss"| EDGE
EDGE -->|"on miss"| APPC
APPC -->|"on miss"| LOCAL
LOCAL -->|"on miss"| DB
DB -.->|"backfill on read"| APPC
classDef client stroke:#64748b,stroke-width:1.75px;
classDef edge stroke:#3b82f6,stroke-width:1.75px;
classDef database stroke:#f97316,stroke-width:1.75px;
classDef cache stroke:#10b981,stroke-width:1.75px;
Case studies per layer
Facebook TAO (③ application cache) — Facebook’s entire social graph reads go through TAO, a memcached-based read-through cache in front of MySQL. At peak, TAO serves billions of reads per second with a 99%+ hit ratio; only the 1% miss ever touches MySQL. The social graph would be physically impossible to serve without this layer.
Netflix EVCache (② regional) — Memcached clusters replicated across AZs. The Netflix homepage for a single user aggregates results from 20+ microservices; EVCache holds each microservice’s result per-user so the homepage builds in <100 ms even when individual services are slow. At peak it serves tens of millions of ops/sec per region.
Cloudflare / Fastly (① CDN) — Stack Overflow caches their rendered HTML at the edge with a 1-minute TTL + stale-while-revalidate. The result: the primary origin sees ~5% of total traffic. Every question page and tag feed is served from the CDN unless it’s been edited.
In-process LRU (④) — commonly forgotten. A 100 MB Caffeine cache inside every pod catches the top ~1000 hot keys with zero network hop. Used heavily at Netflix (Hystrix) and Google (Guava Cache). Trade-off: consistency across pods is loose — each pod has its own copy; invalidation is harder.
Wikipedia’s Varnish + Memcached stack is the textbook CDN-plus-object-cache architecture. Every article page first tries Varnish (edge HTTP cache, 5-minute TTL) for unauthenticated readers — that layer alone serves over 90% of Wikipedia’s page views. Misses hit MediaWiki, which checks Memcached for parsed page fragments before rendering from MariaDB. The database sees only a trickle of actual reads; for a site hosting tens of billions of monthly views, this is how the operation stays affordable on donations. (Wikimedia infrastructure docs.)
Shopify Black Friday stresses every layer simultaneously. Their team pre-warms the CDN edges and Redis hot-key caches for the top 10k merchants the week before peak, simulates traffic in “Shopify Resiliency Engineering” game days, and monitors cache hit ratios as a first-class SLO. During the 2023 BFCM weekend they served 75.5 million orders and $9.3 B in sales without a primary-database scale event — almost entirely because the read tier absorbed the spike.
When caches go wrong
A cache is a distributed copy of your database that will, eventually, disagree with it. The five failure modes the interviewer wants to hear:
- Stale reads — TTL expired but data changed mid-window. Accept for feeds; fatal for balances.
- Cache stampede (thundering herd) — a popular key expires; 10,000 clients all miss and hit the DB simultaneously. Fix: request coalescing (singleflight), stale-while-revalidate, or add a small random jitter to TTLs.
- Hot-key overload — one celebrity account’s profile is 5% of global traffic; the single Redis shard holding it melts. Fix: client-side shadow replicas of hot keys, or a local LRU layer in front.
- Cache penetration — a bad actor queries keys that don’t exist; every miss hits the DB. Fix: cache negative responses with a shorter TTL, or a bloom filter at the cache layer.
- Cache avalanche — power cycle + all TTLs expire simultaneously. Fix: warm the cache before serving traffic, staggered TTLs.
5. Cache invalidation — pick your poison
The hard part isn’t caching; it’s knowing when the cached value is wrong. Four canonical strategies, each with a production champion:
flowchart TD
Q["<img src="/icons/mdi-help-circle.svg" alt="mdi:help-circle" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> New cached value<br/><span class='sub'>which strategy?</span>"]:::ask
TTL["<img src="/icons/mdi-timer-outline.svg" alt="mdi:timer-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> TTL<br/><span class='sub'>value + expiry</span>"]:::step
WT["<img src="/icons/mdi-database-arrow-right.svg" alt="mdi:database-arrow-right" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Write-through<br/><span class='sub'>cache + DB, one txn</span>"]:::step
WB["<img src="/icons/mdi-database-clock.svg" alt="mdi:database-clock" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Write-behind<br/><span class='sub'>DB catches up async</span>"]:::step
INV["<img src="/icons/mdi-bell-ring.svg" alt="mdi:bell-ring" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Event-driven<br/><span class='sub'>subscribers invalidate</span>"]:::step
Q --> TTL
Q --> WT
Q --> WB
Q --> INV
TTL -.->|"simple, slightly stale"| OUT1["<img src="/icons/mdi-view-dashboard.svg" alt="mdi:view-dashboard" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Dashboards<br/><span class='sub'>read-heavy</span>"]:::ok
WT -.->|"strict, slower writes"| OUT2["<img src="/icons/mdi-cart.svg" alt="mdi:cart" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Pinterest, Shopify<br/><span class='sub'>board / cart</span>"]:::warn
WB -.->|"fast writes, crash loss"| OUT3["<img src="/icons/mdi-chart-line.svg" alt="mdi:chart-line" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Metrics<br/><span class='sub'>analytics</span>"]:::warn
INV -.->|"strong + scalable, complex"| OUT4["<img src="/icons/mdi-car.svg" alt="mdi:car" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Uber, Facebook TAO<br/><span class='sub'>trip / graph</span>"]:::ok
classDef ok stroke:#16a34a,stroke-width:1.75px;
classDef warn stroke:#d97706,stroke-width:1.75px;
classDef step stroke:#3b6fd6,stroke-width:1.75px;
classDef ask stroke:#6366f1,stroke-width:1.75px;
TTL — simplest. SET key "..." EX 60. Works for 80% of cases. Accept 60 seconds of staleness; gain infinite simplicity.
Write-through — write hits both cache and DB in one transaction. Pinterest uses this for board state: you pin a photo, the cached board view updates before the HTTP response returns. Cost: every write is now 2× slower.
Write-behind (write-back) — write to cache first, return 200, sync to DB asynchronously. Metrics collectors (Datadog, Prometheus) do this for throughput; accept losing the last second of data on crash.
Event-driven invalidation — the DB (or its CDC stream) emits a changed(key=X) event; cache subscribers purge that key. Uber invalidates trip-state caches this way: a driver status update goes to Kafka, cache invalidators consume it and purge the relevant keys within ~100 ms. Facebook TAO uses a similar pattern for social graph edges. Most scalable and most complex — you need an event bus, consumers, and careful ordering.
6. Fan-out — the classic timeline problem
When one action (a post, a like, a purchase) must appear in many users’ personalized views, you face the timeline problem. Two architectures:
flowchart LR
subgraph FOW ["① Fan-out on write (push)"]
direction TB
P1["<img src="/icons/mdi-pencil.svg" alt="mdi:pencil" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Alice posts"]:::warn
W1["<img src="/icons/mdi-cog.svg" alt="mdi:cog" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Writer<br/><span class='sub'>fan-out</span>"]:::compute
INBOX1["<img src="/icons/mdi-inbox.svg" alt="mdi:inbox" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Bob's timeline"]:::storage
INBOX2["<img src="/icons/mdi-inbox.svg" alt="mdi:inbox" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Carol's timeline"]:::storage
INBOXN["<img src="/icons/mdi-inbox.svg" alt="mdi:inbox" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> … 5M timelines"]:::storage
P1 --> W1
W1 --> INBOX1
W1 --> INBOX2
W1 --> INBOXN
end
subgraph FOR ["② Fan-out on read (pull)"]
direction TB
P2["<img src="/icons/mdi-pencil.svg" alt="mdi:pencil" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Alice posts"]:::ok
DB["<img src="/icons/mdi-database-outline.svg" alt="mdi:database-outline" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> posts table<br/><span class='sub'>one row</span>"]:::database
READ["<img src="/icons/mdi-account.svg" alt="mdi:account" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Bob opens app<br/><span class='sub'>merge + sort</span>"]:::ok
P2 --> DB
READ --> DB
end
classDef clusterRed fill:transparent,stroke:#dc2626,stroke-dasharray:6 4
classDef clusterGreen fill:transparent,stroke:#16a34a,stroke-dasharray:6 4
class FOW clusterRed
class FOR clusterGreen
classDef compute stroke:#7c3aed,stroke-width:1.75px;
classDef database stroke:#f97316,stroke-width:1.75px;
classDef storage stroke:#f59e0b,stroke-width:1.75px;
classDef ok stroke:#16a34a,stroke-width:1.75px;
classDef warn stroke:#d97706,stroke-width:1.75px;
| Fan-out on write | Fan-out on read | |
|---|---|---|
| Write cost | O(followers) — massive | O(1) |
| Read cost | O(1) — timeline is precomputed | O(people you follow) — scatter-gather |
| Storage | N × followers × posts | N × posts |
| Wins when | Most users have few followers, reads far exceed writes | Users follow few accounts; write amplification infeasible |
Twitter (2013-ish) — famously hit a wall on pure fan-out on write. A tweet from Taylor Swift (100M followers) wrote 100M inbox entries. Their answer:
The hybrid — fan-out on write + fan-out on read for celebrities.
flowchart TD
POST["<img src="/icons/mdi-pencil.svg" alt="mdi:pencil" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Alice posts"]:::step
CHECK{{"Celebrity?<br/><span class='sub'>>1M followers</span>"}}:::ask
WRITE["<img src="/icons/mdi-account-multiple.svg" alt="mdi:account-multiple" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Fan-out on write<br/><span class='sub'>push to inboxes</span>"]:::ok
NOWRITE["<img src="/icons/mdi-close-octagon.svg" alt="mdi:close-octagon" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Skip fan-out<br/><span class='sub'>too expensive</span>"]:::warn
READ["<img src="/icons/mdi-account.svg" alt="mdi:account" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Bob opens app"]:::step
INBOX["<img src="/icons/mdi-inbox.svg" alt="mdi:inbox" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Precomputed timeline<br/><span class='sub'>regular posts</span>"]:::ok
MERGE["<img src="/icons/mdi-merge.svg" alt="mdi:merge" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Merge celebrity posts<br/><span class='sub'>fetched on read</span>"]:::highlight
TIMELINE["<img src="/icons/mdi-cellphone.svg" alt="mdi:cellphone" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Final timeline"]:::step
POST --> CHECK
CHECK -->|"no (regular)"| WRITE
CHECK -->|"yes (celebrity)"| NOWRITE
READ --> INBOX
READ --> MERGE
INBOX --> TIMELINE
MERGE --> TIMELINE
classDef highlight fill:#f3ebff,stroke:#7c3aed,stroke-width:1.5px,color:#4c1d95;
classDef ok stroke:#16a34a,stroke-width:1.75px;
classDef warn stroke:#d97706,stroke-width:1.75px;
classDef step stroke:#3b6fd6,stroke-width:1.75px;
classDef ask stroke:#6366f1,stroke-width:1.75px;
Real-world hybrid examples:
- Instagram — identical strategy; the “Explore” feed is also fan-out on read over ranked candidate posts.
- LinkedIn feed — fan-out on write to a per-user “follow feed” using Kafka + Samza stream processing. When a user posts, Samza fans the event out to follower inboxes within seconds. Ranking and personalization happen at read time on top of the fanned-out candidate set. Scale: billions of feed updates per day across hundreds of millions of members.
- Discord channel messages — fan-out on read (one channel, many subscribers). Scales because channels are small (usually <1k active members) and Cassandra/ScyllaDB handles the per-channel partition cheaply.
- Slack — fan-out on write for mentions (push notifications + badge counts), fan-out on read for channel history. Hybrid chosen per message type; mentions rarely exceed a few dozen recipients, channel scrollback can be thousands.
- Tinder’s Swipe API — a specialized variant. Each user’s “deck” is precomputed (fan-out on write when new candidates match filters), stored in Redis as a sorted set. Swiping pops the head of the list and pushes telemetry into Kafka for re-ranking. This is fan-out-on-write applied to a personalized queue rather than a social feed.
The interview answer: “I’d default to fan-out on write for the normal case and switch to fan-out on read for high-fanout outliers. I’d cap the fan-out write to ~10k followers and mark accounts above that as ‘pull’ so the math stays bounded.”
7. Denormalization, materialized views, and approximation
When you can’t shrink the read load, move the work to write time.
Precomputed read models
Reddit’s hot feed — Reddit maintains sorted sets in Redis keyed by subreddit, with score = hotness function (log10(upvotes) + age_factor). Every upvote updates the sorted set score. Rendering the hot page is ZREVRANGE subreddit:pics 0 25 — a single 1 ms Redis call, not a DB aggregation.
YouTube’s watch history — every play writes to a per-user materialized list (“recently watched”), kept in Bigtable with a TTL. The alternative, aggregating play events from an event stream at read time, would take seconds per user.
Approximate counters — trade accuracy for memory
Exact counts on popular objects are an anti-pattern — every view or like becomes a DB write. The academic answer is a family of probabilistic data structures that compress unbounded sets into fixed-size sketches with provable error bounds.
- HyperLogLog (Flajolet, Fusy, Gandouet, Meunier, 2007) — counts unique elements from a stream with standard error
1.04 / √mwheremis the register count. The defaultm = 2^14buckets (16 KB) gives ≈0.8% error on cardinalities up to 10⁹. YouTube’s billion-view counters, Redis’sPFCOUNT, and Google BigQuery’sAPPROX_COUNT_DISTINCTare all HLL. - Bloom filters (Bloom, 1970) — test set membership with zero false negatives and bounded false positives. Given
nitems,mbits, andkhash functions, false-positive rate is(1 − e^(−kn/m))^k, minimized whenk ≈ (m/n) ln 2. Cassandra uses a bloom filter per SSTable to skip disk reads; Medium uses them to detect articles you’ve already read. - Count-min sketch (Cormode & Muthukrishnan, 2005) — approximate frequency counts in sub-linear space; what Redis uses internally for its
CMSKETCHcommands and what ad-tech uses for fraud counters.
Practical examples:
- YouTube view count — for videos under ~1000 views, exact. Above that, HyperLogLog estimates ±2% from a tiny (<1 KB) sketch. The homepage shows “1.2M views” — nobody cares that it’s actually 1,199,834.
- Twitter like counts — buffered in Redis with periodic flush; exact within the last few seconds at most.
- Instagram story views — recent activity is exact, historical is summarized.
The general principle: a deterministic query over billions of events is linear work. A sketch is constant-space, sub-linear update — you pay O(1) per insert and O(1) per query, accepting bounded error. For read-heavy analytics where “close enough” is fine, this is how you escape the write-amplification trap entirely.
-- Instead of O(plays) sums every request, precompute:CREATE MATERIALIZED VIEW video_play_counts ASSELECT video_id, COUNT(*) AS playsFROM playsWHERE created_at > NOW() - INTERVAL '30 days'GROUP BY video_id;
CREATE UNIQUE INDEX ON video_play_counts(video_id);REFRESH MATERIALIZED VIEW CONCURRENTLY video_play_counts; -- every 5 min via cronReaders get SELECT plays FROM video_play_counts WHERE video_id = ? — a single indexed row hit instead of an aggregation over billions of events. Trade 5-minute staleness for 3-order-of-magnitude speedup.
8. CQRS — two data models, one truth
If your read shape fundamentally differs from your write shape, stop forcing a single schema. Command Query Responsibility Segregation (CQRS) splits them:
- Write side — normalized, transactional, source of truth (Postgres, DynamoDB).
- Read side — denormalized, eventually consistent, optimized for each query shape (Elasticsearch for search, Redis for hot paths, OLAP store for dashboards).
- A CDC stream (Debezium, DynamoDB Streams, Kafka) propagates writes from source of truth to read models.
flowchart LR
U["<img src="/icons/mdi-account.svg" alt="mdi:account" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Writer<br/><span class='sub'>POST /order</span>"]:::client
POSTGRES["<img src="/icons/logos-postgresql.svg" alt="logos:postgresql" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Postgres<br/><span class='sub'>source of truth</span>"]:::highlight
CDC["{iconify:logos:apache-kafka} CDC stream<br/><span class='sub'>ex. Debezium</span>"]:::queue
ES["<img src="/icons/logos-elasticsearch.svg" alt="logos:elasticsearch" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Elasticsearch<br/><span class='sub'>search</span>"]:::search
REDIS["<img src="/icons/logos-redis.svg" alt="logos:redis" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Redis<br/><span class='sub'>hot lookups</span>"]:::cache
OLAP["<img src="/icons/mdi-chart-line.svg" alt="mdi:chart-line" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> ClickHouse<br/><span class='sub'>analytics</span>"]:::storage
R1["reader: search"]:::client
R2["reader: profile"]:::client
R3["reader: dashboard"]:::client
U ==>|"POST /order"| POSTGRES
POSTGRES ==>|"WAL changes"| CDC
CDC --> ES
CDC --> REDIS
CDC --> OLAP
R1 --> ES
R2 --> REDIS
R3 --> OLAP
classDef highlight fill:#f3ebff,stroke:#7c3aed,stroke-width:2px,color:#4c1d95;
classDef client stroke:#64748b,stroke-width:1.75px;
classDef cache stroke:#10b981,stroke-width:1.75px;
classDef queue stroke:#a855f7,stroke-width:1.75px;
classDef storage stroke:#f59e0b,stroke-width:1.75px;
classDef search stroke:#0ea5e9,stroke-width:1.75px;
Real production uses:
- Airbnb search — Postgres is the source of truth for listings; changes flow via Kafka → Elasticsearch. Search queries never touch Postgres directly.
- Booking.com search — similar pattern at larger scale. The write path is a monolith MySQL cluster for inventory; the read path ships through a Kafka-backed pipeline into a custom Lucene-based index that serves hundreds of thousands of searches per second. Availability, not consistency, is the SLO — a booking attempt re-reads the master inventory for the final atomic “hold” step.
- Uber trip lookups — write path is simple (MySQL); the read path is sharded by city-grid + aggregated into Redis and ELK for ops dashboards.
- Netflix’s homepage — the aggregation layer (CDS) queries ~20 backend services and stitches the response. Each backend maintains its own read model; updates propagate via SQS/Kafka. Netflix famously runs an A/B test on every homepage render — the read path is essentially a CQRS read model materialized per user per experiment arm.
- Hasaki payroll calculations — compute engine writes to Postgres; employee-facing dashboards hit a denormalized per-month Redis precompute, never the transactional tables.
Trade-offs:
- Eventual consistency between write and read models. Enforce SLA (“read replica is ≤5s behind”) and surface stale-data indicators in the UI.
- Operational complexity — more moving parts, more failure modes, more monitoring.
- Schema evolution is harder — a breaking write-side change means rebuilding every read model.
When CQRS is overkill: CRUD apps, anything a single Postgres handles in <10 ms p99. When CQRS pays off: search, analytics, anything with a dominant read-shape that would make the write schema awful (e.g., full-text search on top of a normalized relational model).
9. Cheat sheet — which lever, in what order
flowchart TD
Q["<img src="/icons/mdi-speedometer.svg" alt="mdi:speedometer" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Read latency / QPS hurts"]:::ask
CHECK1{{"Slow query + pool<br/><span class='sub'>fixed?</span>"}}:::ask
CHECK2{{"One writer + N readers<br/><span class='sub'>enough?</span>"}}:::ask
CHECK3{{"Same answer<br/><span class='sub'>repeated often?</span>"}}:::ask
CHECK4{{"One write → many reads<br/><span class='sub'>timeline / feed?</span>"}}:::ask
CHECK5{{"Read shape ≠<br/><span class='sub'>write shape?</span>"}}:::ask
TUNE["<img src="/icons/mdi-tune.svg" alt="mdi:tune" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Tune first<br/><span class='sub'>index + pool</span>"]:::step
REP["<img src="/icons/logos-postgresql.svg" alt="logos:postgresql" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Read replicas<br/><span class='sub'>+ read-your-writes</span>"]:::ok
CACHE["<img src="/icons/logos-redis.svg" alt="logos:redis" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Caching tier<br/><span class='sub'>CDN + Redis + LRU</span>"]:::ok
FAN["<img src="/icons/mdi-call-split.svg" alt="mdi:call-split" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> Fan-out hybrid<br/><span class='sub'>push + pull</span>"]:::warn
CQRS["<img src="/icons/mdi-source-branch.svg" alt="mdi:source-branch" class="diagram-node-icon" width="24" height="24" style="display:block;margin:0 auto 4px;" /> CQRS + CDC<br/><span class='sub'>ES / Redis / OLAP</span>"]:::highlight
Q --> CHECK1
CHECK1 -->|"no"| TUNE
CHECK1 -->|"yes"| CHECK2
CHECK2 -->|"yes"| REP
CHECK2 -->|"no — scaling"| CHECK3
CHECK3 -->|"yes"| CACHE
CHECK3 -->|"no"| CHECK4
CHECK4 -->|"yes"| FAN
CHECK4 -->|"no"| CHECK5
CHECK5 -->|"yes"| CQRS
CHECK5 -->|"no"| TUNE
classDef highlight fill:#f3ebff,stroke:#7c3aed,stroke-width:1.5px,color:#4c1d95;
classDef ok stroke:#16a34a,stroke-width:1.75px;
classDef warn stroke:#d97706,stroke-width:1.75px;
classDef step stroke:#3b6fd6,stroke-width:1.75px;
classDef ask stroke:#6366f1,stroke-width:1.75px;
One-sentence summary per lever
| Lever | Pull when | Real system |
|---|---|---|
| Tune the query | Before everything else | Stack Overflow, 9 servers for 5B pageviews |
| Connection pool | App pods × open connections > DB max | PgBouncer in every serious Postgres deploy |
| Read replicas | Writes are fine, reads saturate CPU | GitHub MySQL, Shopify |
| CDN / edge cache | Same HTML/asset served to many users | Stack Overflow, every content site |
| App cache (Redis) | Same keyed answer served to many sessions | Facebook TAO, Netflix EVCache |
| Fan-out hybrid | One write amplifies to many users | Twitter, Instagram |
| Denormalize / MV | Aggregations over too many rows | Reddit hot feed, YouTube counters |
| Approximate counting | Exact count doesn’t matter | YouTube views, Instagram story views |
| CQRS + CDC | Read shape fundamentally ≠ write shape | Airbnb search, Uber trip lookup |
Pitfalls to volunteer
- Reading the primary right after a write to a replica — every interviewer tests this. Solution: read-your-writes routing or a write-log cookie.
- Cache stampede on a hot key — the Taylor Swift problem. Solution: singleflight, stale-while-revalidate, jittered TTLs, local LRU for celebrity keys.
- Thundering herd on LB restart / cold cache — pre-warm caches on deploy; stagger deploys across pods.
- Writing the count — any
UPDATE counter SET n = n + 1on a popular row is a single-row lock contention point. Use HLL, per-shard counters, or a write-behind with batched flush. - Over-fanning-out — a naive fan-out on write for celebrity accounts will OOM the writer. Always cap and flip to pull.
- CQRS without monitoring replication lag — read models silently fall minutes behind; users file bugs about stale dashboards. Expose lag as a gauge and page when it exceeds SLA.
- CDN caching personalized HTML — the cache key must include the user. Missing
Vary: Cookieon a “logged in” version has leaked user data in several famous incidents. Connection: closeon every response — kills client-side pooling, pushes load back to the DB connection pool instead of HTTP.
Read scaling is a menu of levers, not a single technique. Good interview answers walk down the list: “I’d tune the slow query, add a pool, then look at replicas, then cache, and only then fan-out or CQRS.” That ordering — cheap fixes before expensive architecture — is what distinguishes a senior answer from a junior one.
10. Foundational reading
Every lever in this post has an academic paper or canonical source behind it. Internalizing the primary sources is what separates a candidate who recites patterns from one who can derive them when the interviewer twists the problem.
On the fundamental trade-offs:
- Brewer, E. (2000). Towards Robust Distributed Systems — the original CAP conjecture. (Keynote slides)
- Gilbert, S. & Lynch, N. (2002). Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services — the formal proof.
- Abadi, D. (2012). Consistency Tradeoffs in Modern Distributed Database System Design: CAP is Only Part of the Story — introduces PACELC. (Paper)
- Fischer, M., Lynch, N. & Paterson, M. (1985). Impossibility of Distributed Consensus with One Faulty Process — “FLP impossibility”; why consensus is hard.
On tail latency and modeling:
- Dean, J. & Barroso, L. A. (2013). The Tail at Scale — ACM article that reframes how we think about p99 at scale. (PDF)
- Little, J. D. C. (1961). A Proof for the Queuing Formula: L = λW — original queueing-theory result the whole concurrency industry rests on.
- Gunther, N. (2007). Guerrilla Capacity Planning — the Universal Scalability Law. Companion to any capacity discussion.
- Amdahl, G. (1967). Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities — Amdahl’s Law; why parallelization hits a ceiling.
On caching + storage:
- Bloom, B. H. (1970). Space/Time Trade-offs in Hash Coding with Allowable Errors — Bloom filter.
- Flajolet, P., Fusy, É., Gandouet, O., Meunier, F. (2007). HyperLogLog: The Analysis of a Near-Optimal Cardinality Estimation Algorithm. (PDF)
- Cormode, G. & Muthukrishnan, S. (2005). An Improved Data Stream Summary: The Count-Min Sketch and its Applications.
- Megiddo, N. & Modha, D. (2003). ARC: A Self-Tuning, Low Overhead Replacement Cache — a defensible answer to “LRU or LFU?”
On replication and eventual consistency:
- DeCandia, G. et al. (2007, Amazon). Dynamo: Amazon’s Highly Available Key-Value Store — the paper that popularized eventual consistency in industry.
- Corbett, J. C. et al. (2013, Google). Spanner: Google’s Globally-Distributed Database — TrueTime, external consistency.
- Lloyd, W. et al. (2011). Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS.
On the industry-standard textbook:
- Kleppmann, M. Designing Data-Intensive Applications (O’Reilly, 2017). Chapter 5 (replication), Chapter 6 (partitioning), Chapter 11 (stream processing). Unofficial syllabus of every senior system-design interview.
- Tanenbaum, A. & Van Steen, M. Distributed Systems: Principles and Paradigms (3rd ed., 2017). Longer, denser, academically thorough.
If you can cite even two of these in an interview — and explain why a lever you picked satisfies or violates the paper’s assumptions — the level the conversation shifts to is noticeably different.