System Design Interview - An Insider's Guide

1: Scale from Zero to Millions of Users

Load Balancer

A load balancer is a system that distributes incoming network or application traffic across multiple backend servers. Its primary purpose is to improve availability, reliability, and performance by preventing any single server from becoming a bottleneck or point of failure.

Core Components

Frontend System
Backend Pool
Health Checks
Routing Logic

Routing Logic

Round Robin
Least Connections
Weighted Distribution
IP Hash

“Any server should be able to handle any request independently”

Stateless vs Stateful Sessions

Why is Stateless Backend Preferred?

Horizontal Scalability
Simpler Load Balancing
Fault Tolerance
Easier Deployments

What does Stateless men?

Server does not store client-specific state between requests
State in instead stored in shared systems:
- Database
- Cache
- Object Storage

When Stateful Backends are Used?

Real-time Systems
- WebSockets
- Multiplayer Games
- Live Collaboration Tools
High Performance Caching
- Reduces Latency
- Avoids repeated database hits
Legacy Systems
Specialized workloads
- Long-running processes
- In-memory computation engines

Subtle Tradeoff Stateless systems shift complexity rather than eliminate it. Instead of:

Managing in-memory sessions You now manage:
Distributed consistency
Cache invalidation
Network latency to state stores So the system becomes:
Operationally simpler at the server level
Architecturally more distributed overall**

Use stateless servers unless you have a clear reason not to.

Database Replication

Usually using a master/slave relationship between the original(master) and the copies(slaves).
A master database generally only supports write operations. A slave database gets the copies of the data from the master database and only supports read operations.
Most applications require a much higher ration of reads to writes; thus, the number of slave databases is usually larger than the number of master databases
Advantages
- Better performance: It allows more queries to be processed in parallel
- Reliability: If one of the database servers is destroyed, data is still preserved
- High Availability: Even if a database is offline I can access data stored in another database server
What if a database goes offline?
- If only one slave database is available and it goes offline, read operations will be directed to the master database temporarily. As soon as the issue is found, a new slave will replace the old one. If multiple slaves are available, the read operations are redirected to a healthy slave.
- If the master goes offline, a slave will be promoted to be the new master and a new slave will replace the old one for data replication immediately.
  - In Prod, promoting a new master is more complicated as the data in a slave db might not be up to date. It needs to be updated using recovery scripts. Other replication methods: multi-masters, circular replication could help.

Why Slaves Have Stale Data

Replication is asynchronous — the master sends write operations to slaves after locally committing the write. If the master crashes before replication completes, promoted slaves lose those updates.

Master writes at Time 0 → crashes at Time 0.3 → slave still has old data
Recovery scripts read the master’s binary log to find missing operations and replay them on the slave before promotion

Multi-Master Replication

Multiple masters accept both reads and writes
Changes on any master replicate to all others
If one master fails, others continue accepting writes — no promotion needed
Tradeoff: Write conflicts when the same data is modified on multiple masters simultaneously Resource: https://en.wikipedia.org/wiki/Multi-master_replication Example:
3 masters: A, B, C
Normal: A writes balance=100, B writes name="Alice" → all sync
A crashes: B and C keep running, app routes writes to them
A recovers: syncs missed writes from B or C
Conflict: if A writes balance=200 and B writes balance=150 at same time, which wins?

Circular Replication

Ring topology: A → B → C → A
Each node is both master and slave
If one node fails, the ring reconfigures to bypass it
Tradeoff: Latency compounds around the ring; propagation delays mean data may be inconsistent Example:
Ring A → B → C → A
A receives INSERT users(1,"Alice") → syncs to B → syncs to C
B crashes: ring reconfigures to A → C → A
Propagation delay: if each hop = 5ms, a write on A takes ~10ms to reach C
In a 10-node ring, last node sees writes ~45ms after they committed on first node Resource: https://dev.mysql.com/doc/refman/5.7/en/mysql-cluster-replication-multi-master.html

Cache

A cache is a temporary storage area that stores the result of expensive responses or frequently accessed data in memory so that subsequent requests are served more quickly.
Resource: https://codeahoy.com/2017/08/11/caching-strategies-and-how-to-choose-the-right-one/
Read-Through Cache - First check if the cache has the available response, if yes return that. If not, query the db, store the response in cache and send it back to client.
Considerations
- Decide when to use cache: When data is read frequently but modified infrequently. Not ideal for persisting data
- Expiration Policy: Not too short otherwise the system will have to reload data from db too frequently. Not too long otherwise the data will become stale.
- Consistency: Involves keeping the data store and the cache in sync. Resource: R. Nishtala, “Facebook, Scaling Memcache at, 10th USENIX Symposium on Networked Systems Design and Implementation
- Mitigating failures: A single cache server represents a potential single point of failure.
- Eviction Policy: Once the cache is full, any requests to add items to the cache might cause the existing items to be removed. LRU, LFU, FIFO, etc can be adopted to satisfy different use cases

Contend Deliver Network (CDN)

A CDN is a network of geographically dispersed servers used to deliver static content. CDN servers cache static content like images, videos, CSS, etc.
Dynamic Content Caching enables the aching of HTML pages that are based on request path, query strings, cookies and request headers Resource: https://aws.amazon.com/cloudfront/dynamic-content/
Considerations
- Cost: We are charged for data transfers in and out of the CDN. Caching infrequently used assets provides no significant benefits.
- Setting an apt cache expiry: Similar logic to that of cache expiry
- CDN fallback: We should consider how the application copes with CDN failure. Clients should be able to detect a problem and request resources from the origin.
- Invalidating files

Amadeus

Explorer