Yelp / Nearby Friends — Proximity Search HLD

Read this with the framework in mind

This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.

Framework → 8 Patterns → Tech Cheat Sheet →

Step 1

What is Yelp?

It's 14:02 on a Saturday. Sarah just landed at SFO, took the BART to Union Square, and her stomach is growling. She pulls out her phone, opens Yelp, types "ramen", and within 100 milliseconds her screen fills with 50 highly-rated restaurants — sorted by distance, all within walking range, complete with star ratings, photos, and review snippets. Behind that one tap is a system that just searched through 500 million places worldwide and returned the right 50 — fast enough that Sarah didn't even notice the wait. That's the system we're designing.

Yelp (and apps like Foursquare, Google Maps "Nearby", or Uber Eats restaurant discovery) lets users add places (a new bakery opens — someone registers it), discover nearby places via search (the central read workload), and read & write reviews with ratings, photos, and text. The hard part isn't storing places — a database can do that. The hard part is the geo lookup: given a latitude and longitude and a radius, find every place inside that circle, ranked by relevance, in under 100ms, while 100,000 other Sarahs around the world are doing the same thing simultaneously.

The two questions that drive every design decision below: (1) How do we index 500 million places by geographic location so a "nearby" query is fast? (2) How do we serve 100,000 search queries per second when most search results are bespoke to the user's exact GPS coordinates and can't be cached the way a viral tweet can?

Step 2

Requirements & Goals

Before drawing a single box, pin down exactly what the system must do. In an interview, asking these questions shows you're not just regurgitating a memorized solution.

✅ Functional Requirements

Users can add, update, or delete a place (name, location, category, description)
Given a (latitude, longitude, radius), return all places within that circle
Users can read and write reviews with star rating, photos, and text
Filter results by category (ramen, bars, gyms) and sort by rating, distance, or popularity
Pagination — Sarah might want to see results 21–40 if the first 20 don't appeal

⚙️ Non-Functional Requirements

Real-time search — under 100ms p99 latency for the proximity query
Heavy read load — reads dominate writes by orders of magnitude
Highly available — Yelp going dark on a Saturday night is a brand event
Scalable — 20% growth in places per year, more in dense cities

Reads dominate writes by 4–5 orders of magnitude. A new place opens once and gets searched-for millions of times. This single fact drives the entire design: we will spend almost zero engineering on the write path and obsessively optimize the read path with in-memory indexes, fan-out parallelism, and aggressive caching.

Step 3

Capacity Estimation & Constraints

Numbers are not optional in HLD — they drive every architectural choice. We'll size the index, the cache, and the shard count based on these.

Traffic estimates

Assume 500 million places globally (restaurants + bars + shops + gyms + theaters + ...) with 20% growth/year. Assume 100,000 search QPS at peak — a busy Saturday evening across all time zones combined. Reviews are written at maybe 1,000 QPS — a rounding error compared to searches.

Places

500M total

Growing 20%/year

Search QPS

100K/sec peak

Read-heavy by far

Write QPS

~1K/sec

New places + reviews

Latency p99

under 100ms

Real-time UX

Per-place storage

Each Place row holds: id (8B), name (256B), latitude (8B), longitude (8B), description (512B), category (1B). Total: roughly ~793 bytes per place.

Metric	Value	Why it matters
Total places	`500M`	Drives index size and shard count
Place storage	`~400 GB`	500M × 793B; fits across modest fleet
Index size (lat+long+id)	`~12 GB`	500M × ~24B per leaf entry — fits in RAM!
Search QPS	`100K/s`	Forces sharding + fan-out + cache
Reviews/photos	`multi-TB`	Stored separately from place index

The crucial number: the geo-index alone is roughly 12 GB — a number small enough to fit in the RAM of a single server. That's not a coincidence. The whole point of building a custom in-memory spatial index (instead of bolting B-trees onto MySQL) is that we want every search to be answered without ever touching a spinning disk. 12 GB tells us the dream is achievable; we just need the right index.

Step 4

Database Schema

Three tables carry the entire data model. We deliberately split Place (small, hot, indexed by location) from Review and Photo (large, cold-ish, indexed by place_id) so the geo-index stays tiny and fast.

erDiagram PLACE { string id PK string name double latitude double longitude string description string category } REVIEW { string id PK string place_id FK string user_id FK string text int rating timestamp created_at } PHOTO { string id PK string review_id FK string place_id FK string photo_path } USER { string id PK string name string email timestamp joined_at } PLACE ||--o{ REVIEW : "has" PLACE ||--o{ PHOTO : "has" USER ||--o{ REVIEW : "writes" REVIEW ||--o{ PHOTO : "contains"

Why split Review and Photo from Place? Because Place gets indexed by a custom in-memory geo structure (the QuadTree we'll build in §6), and we don't want every byte of review text bloating that index. A place with 10,000 reviews still occupies a single tiny entry in the geo-index — its reviews and photos live in separate tables that we only read after the user picks a place.

Why NoSQL / wide-column for the storage tier: Place is queried by id in the millions of times per second after a geo-lookup hands us a list of place IDs. That's a pure key-value access pattern — Cassandra or DynamoDB shine here. Reviews are similarly accessed by place_id. There are no joins, no transactions; everything is a denormalized lookup.

Step 5

System APIs

One endpoint dominates: search. A few smaller endpoints for adding places and writing reviews. Defining the API contract early locks down what the architecture has to deliver.

REST API surface

// The big one — the heart of Yelp
GET /api/v1/search
  ?api_key=...
  &search_terms=ramen           // optional text filter
  &user_location=37.7879,-122.4075
  &radius_filter=1600           // meters (1 mile)
  &max_results=20
  &category_filter=restaurant
  &sort=rating                  // rating | distance | popularity
  &page_token=eyJv...           // cursor for pagination

→ 200 OK
{
  "results": [
    { "id": "p_8a3e", "name": "Ippudo Ramen",
      "lat": 37.7882, "lng": -122.4068,
      "distance_m": 47, "rating": 4.6, "review_count": 2104,
      "thumbnail": "..." },
    ...
  ],
  "next_page_token": "eyJv..."
}

// Add / update / delete a place — write path
POST   /api/v1/places         { name, lat, lng, category, description }
PUT    /api/v1/places/:id     { ...partial updates... }
DELETE /api/v1/places/:id

// Reviews
POST /api/v1/places/:place_id/reviews   { rating, text, photos[] }

The shape of search matters: the user's lat/lng is passed in (the device's GPS), not derived from IP. This is what makes geo-search work — the system knows exactly where Sarah is and can build a tight bounding query. The radius_filter defaults to 1 mile / 1600m for "Nearby Friends"-style queries; users can widen it.

Step 6 · CORE

High-Level Architecture — From Naive to Production

This is the section that wins or loses the interview. We build the architecture in three passes: the simplest thing that could plausibly work, why it falls apart, and the production shape where every box justifies itself. The key insight everything turns on is spatial indexing — partitioning the world geographically instead of treating latitude and longitude as two unrelated columns.

Pass 1 — The naive design (and why it breaks)

Sketch the simplest system: one MySQL database, one Place table with two indexes — one on latitude, one on longitude. To find places near Sarah, run a bounding-box query: SELECT * FROM places WHERE lat BETWEEN ? AND ? AND lng BETWEEN ? AND ?.

flowchart LR C["Client"] --> APP["App Server"] APP --> DB[("MySQL — Places — lat idx + lng idx")]

It looks innocent. It is not. Three concrete failures emerge the moment traffic shows up:

💥 Two indexes, no good answer

The query optimizer has to pick: use the lat index, or the lng index, or intersect both. The lat-index alone returns every place at Sarah's latitude — that's a horizontal stripe across the entire planet, millions of rows. Same for the lng-index. Even intersecting two giant result sets is slow — the DB ends up scanning huge intermediate sets to find the few hundred that actually match both filters.

💥 500M rows × 100K QPS = molten disk

Even if the query were perfect, a single MySQL box answering 100,000 spatial queries per second across 500M rows simply won't fit on one disk's IOPS budget. Cache misses turn into random disk seeks — 5ms each, 100K/sec means 500K disk seeks per second, which no spinning drive (and few SSDs) can sustain. The DB CPU pegs at 100% and p99 latency balloons into seconds.

💥 A bounding box is a square, not a circle

The query BETWEEN lat1 AND lat2 selects a rectangle. Sarah asked for "1 mile radius" — that's a circle. The corners of the bounding rectangle contain places that are further than 1 mile from Sarah. We'd need a second filtering pass with the haversine formula. It's correct, but slow, and it shows the model is wrong: lat+lng are not independent, they describe a single 2D location.

Pass 2 — The mental model: partition the world, not the columns

The single most important insight in this design is to stop treating latitude and longitude as two independent columns and start treating them as a single 2D space — and then partition that space into tiles. A search becomes "which tile is Sarah in? give me the places in that tile and its neighbors." Two evolutions of this idea, in order of sophistication:

📐 Approach A — Fixed grid

Divide the planet into a fixed N×N grid of equal-size cells. Each place is filed into the one cell that contains its (lat, lng). To search, find Sarah's cell and look at its places.

Why it fails: places are extremely non-uniform. Manhattan has thousands of restaurants per square kilometer; the Sahara has zero per million square kilometers. Fixed-size cells either drown in places where it's dense, or are mostly empty where it's not. We need cells that adapt to density.

🌳 Approach B — Dynamic grid (QuadTree) ✓

Start with one cell covering the world. When a cell exceeds a threshold (say, 500 places), split it into 4 equal sub-cells (NW, NE, SW, SE). Recursively. A leaf cell with fewer than 500 places stays as-is.

Why it wins: every leaf has roughly the same number of places — Manhattan ends up with thousands of tiny leaves, the Sahara with one giant leaf. Search is fast everywhere because the work-per-leaf is bounded. It's a recursive map-zoom, materialized as a tree.

Think of a QuadTree as a recursive map zoom: when you zoom out you see one giant tile (one node); when you zoom into Manhattan, that tile splits into 4, and each of those splits into 4 again, until each tile is small enough to read at a glance. The deeper the zoom, the more places you cram per tile. The QuadTree codifies that intuition into a data structure we can query in O(log N) time.

Before vs. after: with the naive lat/lng query, finding 50 places near Sarah scans a horizontal stripe of the planet — millions of rows, multiple disk seeks, hundreds of milliseconds. With a QuadTree, finding the same 50 places is "walk the tree to her leaf (10–15 hops in memory), read at most 500 places from that leaf, optionally peek at neighbors" — total work bounded at roughly 2,000 in-memory comparisons, under 5ms.

Pass 3 — The production shape

Now the full picture. The QuadTree alone is the index — but we also need an API tier, a fan-out aggregator (because we'll shard the QuadTree across many servers), the source-of-truth Place DB, a cache, a review/photo service, and a ranking service. Every node is numbered — find its matching card below to see what it does and what would break without it.

flowchart TB CL["① Client — Mobile / Web"] subgraph EDGE["Edge Tier"] LB["② Load Balancer"] API["③ Search API Server"] end subgraph SEARCH["Search Plane"] AGG["④ Aggregator"] QT1["⑤a QuadTree Shard 1"] QT2["⑤b QuadTree Shard 2"] QT3["⑤c QuadTree Shard 3"] end subgraph INDEX["Index Plane"] QIDX["⑦ QuadTree Index Server"] CACHE["⑧ Cache — Memcached"] RANK["⑩ Ranking Service"] end subgraph SOURCE["Source Plane"] PDB[("⑥ Place DB — Cassandra")] RPS["⑨ Review / Photo Service"] RPB[("Photo Blob Store")] end CL --> LB LB --> API API --> AGG AGG --> QT1 AGG --> QT2 AGG --> QT3 AGG --> CACHE AGG --> RANK QT1 -.rebuild from.- PDB QT2 -.rebuild from.- PDB QT3 -.rebuild from.- PDB QIDX -.tracks placeId to shardId.- QT1 QIDX -.tracks placeId to shardId.- QT2 QIDX -.tracks placeId to shardId.- QT3 API --> RPS RPS --> RPB RPS --> PDB style CL fill:#e8743b,stroke:#e8743b,color:#fff style LB fill:#171d27,stroke:#9b72cf,color:#d4dae5 style API fill:#171d27,stroke:#4a90d9,color:#d4dae5 style AGG fill:#171d27,stroke:#e8743b,color:#d4dae5 style QT1 fill:#171d27,stroke:#38b265,color:#d4dae5 style QT2 fill:#171d27,stroke:#38b265,color:#d4dae5 style QT3 fill:#171d27,stroke:#38b265,color:#d4dae5 style PDB fill:#171d27,stroke:#9b72cf,color:#d4dae5 style QIDX fill:#171d27,stroke:#d4a838,color:#d4dae5 style CACHE fill:#171d27,stroke:#3cbfbf,color:#d4dae5 style RPS fill:#171d27,stroke:#e05252,color:#d4dae5 style RPB fill:#171d27,stroke:#e05252,color:#d4dae5 style RANK fill:#171d27,stroke:#d4a838,color:#d4dae5

Component-by-component — what each numbered box does

Use the numbers in the diagram above to find the matching card. Each one answers what is this, why is it here, and what would break without it.

① Client — Mobile / Web

Sarah's iPhone, the Yelp web app, an embedded "places near me" widget. The client knows the user's GPS coordinates (from the OS) and packages them into a GET /api/v1/search request. From the client's view, the entire system is one URL — it sends location + filters and expects a ranked list of nearby places back in 100ms.

Solves: nothing on its own — but every design choice flows backward from "what does the client experience?" Latency, the JSON shape, and the radius semantics are all client-facing concerns.

② Load Balancer

The traffic cop. Sits in front of the Search API tier, distributes incoming HTTPS, and yanks unhealthy backends out of rotation via health checks. Round-robin is fine to start; switch to least-connections once you have enough nodes that load skew matters. Terminates TLS so backend pods don't pay the crypto cost.

Solves: single-server bottleneck and single-server failure. Without an LB, one API crash takes down the whole service. With it, we lose 1/N of capacity for a few seconds until health checks fail the bad node out.

③ Search API Server

Stateless service that owns the public /api/v1/search contract. It validates the user's input (lat/lng look real, radius is bounded, api_key is valid), hands the structured query down to the Aggregator, and shapes the response back into the JSON the client expects. Scales horizontally — just add more pods behind the LB.

Solves: a clean public-facing surface that's decoupled from how the index is laid out internally. We could swap the QuadTree for Geohash tomorrow and the API contract wouldn't change.

④ Aggregator

The brain of the read path. Given a (lat, lng, radius), it figures out which QuadTree shards could possibly contain matching places (because the QuadTree is sharded across many servers), fans out the query to those shards in parallel, collects each shard's top-K candidates, and merges them into a single ranked list. Think of it as the host at a buffet — it doesn't cook anything, it just runs to every station, grabs a plate, and presents the combined tray.

Solves: the fan-out problem. The QuadTree is too big to live on one server — it's sharded by place_id across, say, 30 shards. A geo-query may touch 1, 2, or N of them depending on how the search circle straddles shard boundaries. Without an aggregator, the Search API would have to know the shard topology and run the fan-out itself — bloating its responsibility and making sharding changes painful.

⑤ QuadTree Server cluster

The actual spatial index. Each server holds a partition of the global QuadTree fully in RAM — every leaf node, every internal node, every place_id stored under each leaf. Replies to a query like "give me the top-K places within this bounding box that match these filters" with an in-memory tree walk that takes single-digit milliseconds. Replicated 2× per shard for fault tolerance.

Solves: the actual geo-search problem. This is where Pass 2's mental model becomes code. Without it, every search would fall back to a SQL bounding-box scan over 500M rows — back to Pass 1's 100ms-becomes-1000ms disaster. The QuadTree's recursive subdivision is what makes "find 50 places near here" cost 5ms instead of 500ms.

⑥ Place DB (Cassandra)

The source of truth — a partitioned, replicated NoSQL key-value store holding all 500M place records (id, name, lat, lng, description, category). Sharded by place_id via consistent hashing, replicated 3× across availability zones. Critically, the QuadTree shards do not own the place data; they own only an index pointing at place_ids. If a QuadTree server crashes, we rebuild it from the Place DB.

Solves: durable, scalable, query-by-id storage. Cassandra spreads 400 GB of place data across dozens of nodes and handles failures automatically. The QuadTree gets to be a fast in-memory index without also being a database.

⑦ QuadTree Index Server

A small lookup service that maintains the reverse mapping placeId → quadtreeShardId. When a QuadTree server dies and we need to rebuild it from the Place DB, we don't want to re-shuffle every one of 500M places asking "do you belong here?" — instead we ask the index server "give me every placeId whose home shard was 7" and stream just those rows from Place DB.

Solves: fast crash recovery. Without it, rebuilding one failed QuadTree shard would require scanning the entire 500M-row Place DB to find which 16M places belong to it. With it, recovery is a targeted reload of just those 16M rows — minutes, not hours.

⑧ Cache (Memcached)

Two layers of caching. (a) Hot place objects — when a search returns 20 place_ids, we need to look up the full place object for each. The hottest few million places (popular tourist spots, viral restaurants) live in Memcached, so most lookups never hit Place DB. (b) Hot search results — common queries like "ramen + downtown San Francisco" repeat thousands of times per minute; we cache the top-20 results for the canonical version of these queries, keyed by (city, category, radius_bucket).

Solves: read amplification. Each search returns ~20 place_ids, so 100K searches/sec means 2M place lookups/sec — that would melt Cassandra. Memcached absorbs 95%+ of those lookups, leaving Cassandra sized for the cold tail.

⑨ Review / Photo Service + Storage

A separate microservice owning reviews and user-uploaded photos, with photos stored in a blob store (S3) and review text in its own table keyed by place_id. Critically, this is on the detail-page path, not the search path — when Sarah opens Yelp she only sees thumbnails (one cached photo URL per place). Only when she taps a specific place does the client fetch full reviews and photos.

Solves: keeping the search path light. If we co-located review text and photo blobs with the place index, every search response would balloon to megabytes. Splitting them lets the search response stay tiny (a few KB for 20 places) and lets photo storage scale on its own clock — multi-TB without blowing up our hot index.

⑩ Ranking Service

Once the QuadTree shards return a set of candidate place_ids, those candidates need to be ranked. Score = f(rating, distance, popularity, personalization, sponsored boost). The Ranking Service computes a score per candidate and returns the top-K. Each shard pre-ranks its own candidates and returns its local top-K (say, 100); the Aggregator then merges those K-per-shard into a final top-20.

Solves: turning "places that are nearby" into "places Sarah probably wants". Two restaurants 100m from Sarah, one with 4.8 stars and 5,000 reviews, one with 3.1 stars and 12 reviews — proximity alone would tie them. Ranking surfaces the right one first. Personalization (Sarah is vegetarian; she rates highly-vegan places higher) goes here too.

Concrete walkthrough — Sarah at Union Square, 14:02, looking for ramen within 1 mile

Sarah's exact GPS is (37.7879, -122.4075). She wants 20 ramen spots within 1600m, sorted by rating. Trace it through the numbered components:

📖 Read flow — search request

Sarah's Client ① fires GET /api/v1/search?lat=37.7879&lng=-122.4075&radius=1600&category=ramen&sort=rating.
Load Balancer ② routes to a Search API ③ pod.
API pod hands the structured query to the Aggregator ④.
Aggregator first checks the Cache ⑧ for a canonical version of the query — say {ramen, San_Francisco_grid_cell, 1mi_bucket}. Cache hit on a hot query → return result, done in 15ms. We'll assume miss for the rest.
Aggregator computes the bounding box around Sarah and figures out which QuadTree shards ⑤ could possibly contain the answer. Sarah's box overlaps shards 7 and 12 (her circle straddles a shard boundary).
Aggregator fans out the query in parallel to shards 7 and 12.
Each shard walks its in-memory QuadTree to find the leaf cell containing Sarah's location, plus neighboring cells the radius spills into. It filters by category=ramen, scores each candidate via local Ranking ⑩, returns top-100.
Aggregator merges the two top-100 lists, applies global ranking, takes top-20 place_ids.
For each of those 20 place_ids, Aggregator fetches the full Place object — first from Cache ⑧, falling back to Place DB ⑥ on miss.
Search API ③ shapes the response (name, lat, lng, distance_m, rating, thumbnail) and returns it to Sarah's phone. Total elapsed: ~80ms.

✍️ Write flow — Sarah taps "Ippudo Ramen", reads reviews, leaves her own

Sarah taps a result. Client ① fires GET /api/v1/places/p_8a3e/reviews.
API ③ routes to Review/Photo Service ⑨, which fetches reviews keyed by place_id from its own table and photo URLs from the blob store. Returns top-20 reviews + photo thumbnails.
Sarah eats, then writes a 5-star review. Client POSTs /api/v1/places/p_8a3e/reviews.
Review/Photo Service writes the review row keyed by place_id, uploads the attached photos to blob storage, returns 201.
Note: the QuadTree is not touched on this path. Reviews don't change a place's location; the geo-index is unaffected. Only the Ranking Service may async-update the place's avg-rating denormalized field.

So what: the architecture is built around three insights — (1) treat space as 2D, not as two 1D columns, which gives us the QuadTree; (2) shard the QuadTree across many in-memory servers, which means a fan-out aggregator earns its keep; and (3) split the hot path (geo-search → place_id) from the cold path (place details, reviews, photos), so the search response stays tiny and fast even as place metadata grows multi-TB. Every box in the diagram exists to remove one of Pass 1's failure modes.

Step 7 · DEEP DIVE

The QuadTree — Inside the Index

The QuadTree is the heart of this system. Time to crack it open and understand exactly how a tree of geographic rectangles becomes a sub-millisecond proximity index.

Anatomy

Every node in a QuadTree represents a rectangular region of the planet. The root represents the entire world. Each internal node has up to 4 children — its rectangle split into Northwest, Northeast, Southwest, Southeast quadrants. A leaf node holds an actual list of place_ids (and their lat/lng/category for fast filtering) — but only up to a threshold. We pick 500 places per leaf as the sweet spot: small enough to scan in microseconds, large enough that the tree doesn't get pathologically deep.

flowchart TB ROOT["🌍 Root — World — split"] ROOT --> NW1["NW — N. America — split"] ROOT --> NE1["NE — Europe — leaf — 412 places"] ROOT --> SW1["SW — S. America — leaf — 287 places"] ROOT --> SE1["SE — Asia — split"] NW1 --> NW2["NW — Canada — leaf — 198 places"] NW1 --> NE2["NE — US East — split"] NW1 --> SW2["SW — US West — split"] NW1 --> SE2["SE — Mexico — leaf — 332 places"] NE2 --> NYC["NYC — leaf — 487 places — at threshold"] NE2 --> BOS["Boston — leaf — 156 places"] NE2 --> ATL["Atlanta — leaf — 201 places"] NE2 --> MIA["Miami — leaf — 189 places"] SW2 --> SF["SF — leaf — 498 places"] SW2 --> LA["LA — leaf — 463 places"] SW2 --> SEA["Seattle — leaf — 234 places"] SW2 --> LV["Las Vegas — leaf — 156 places"] style ROOT fill:#171d27,stroke:#e8743b,color:#d4dae5 style NYC fill:#e8743b,stroke:#e8743b,color:#fff style SF fill:#e8743b,stroke:#e8743b,color:#fff style LA fill:#e8743b,stroke:#e8743b,color:#fff

Search algorithm

Given Sarah's (lat, lng) and radius, find the nearby places:

function searchNearby(node, lat, lng, radius, results):
  if node.rectangle does not intersect circle(lat, lng, radius):
    return                          // prune entire subtree

  if node.isLeaf():
    for place in node.places:        // at most 500
      if distance(place, lat, lng) <= radius:
        results.add(place)
    return

  for child in node.children:        // up to 4 quadrants
    searchNearby(child, lat, lng, radius, results)

The magic is the prune step. If a node's rectangle doesn't even touch Sarah's search circle, we skip the entire subtree — possibly millions of places eliminated in one pointer-comparison. This is why the QuadTree gives us O(log N) instead of O(N).

The doubly-linked leaf list — fast neighbor traversal

One subtle but important optimization: every leaf node maintains pointers to its geographically adjacent leaves (a doubly-linked list across the leaves). When Sarah's 1-mile radius spills out of her current leaf into a neighbor, we don't have to walk back up to the root and down again — we just follow the leaf pointer sideways. This turns "walk neighbors" from O(log N) per neighbor into O(1).

Memory math: 500M places ÷ 500 per leaf = 1M leaves. Each leaf entry stores (place_id 8B, lat 8B, lng 8B) = 24B; 500M × 24B ≈ 12 GB of leaf data. Internal nodes with 4 child pointers each are negligible (~10 MB). Total ~12 GB — fits in the RAM of one fat server, but we'll partition for fault tolerance and to spread the 100K QPS load.

Step 8

Insertion & Updates

A new bakery opens in San Francisco. How does it enter the QuadTree?

➕ Insert

Walk the QuadTree from the root: at each internal node, pick the child quadrant whose rectangle contains the new place's (lat, lng). Continue until you hit a leaf. Append the place to that leaf's list. If the leaf now exceeds 500 places, split it: create 4 new sub-quadrants, redistribute the 501 places among them by location, and turn the former leaf into an internal node pointing at the 4 new leaves.

The insert is O(log N) — at most 15 hops down a tree of ~1M leaves.

✏️ Update / Delete

Updating a place's name or category is just a Place DB write — the QuadTree doesn't care. Updating a place's location (rare) means: remove from old leaf, walk down again, insert into new leaf. Deleting is: remove from leaf; if the leaf falls under some lower threshold (say, 100 places) and its 3 sibling leaves are also sparse, merge them back into a single leaf to keep the tree balanced.

Yelp places vs. Uber drivers — the contrast. A Yelp place's location changes almost never — a restaurant doesn't move. So inserts dominate updates, and the QuadTree can be built once and barely touched. Compare to Uber, where a driver's location changes every 3 seconds — that workload would shred a QuadTree (every position update is an insert + delete + possible split + possible merge). Uber's design uses a different structure (typically a Hilbert-curve or geohash-keyed K/V store with TTL'd entries) for exactly this reason. Worth raising in interviews: "this design assumes places are static; if locations updated every few seconds, I'd switch to a geohash-keyed store with eventual consistency."

Step 9

Data Partitioning

The QuadTree is ~12 GB and must serve 100K QPS — both numbers fit on one fat server, but we'd have no fault tolerance and no headroom for growth. We must shard. Two natural choices:

❌ Shard A — Region-based

"Shard 1 owns North America, shard 2 owns Europe, shard 3 owns Asia..." Intuitively appealing — a search in San Francisco only ever hits the North America shard.

Why it fails: the hot region problem. Manhattan has more places (and more searches) than the entire state of Wyoming. The Manhattan shard melts under load while the Wyoming shard sits idle. Adding more shards to North America just relocates the problem — geography is fundamentally non-uniform.

✅ Shard B — Hash-based on place_id

Compute shard = consistent_hash(place_id) % N. Distribution is uniform by construction. Each shard has its own QuadTree containing roughly 1/N of all places — globally distributed, not regionally clustered.

Trade-off: a search at Sarah's location must fan out to all shards in parallel, because the places near her are spread uniformly across every shard. The aggregator (④) merges results. This is the cost of uniformity, and it's why the aggregator exists.

Each shard is a full QuadTree

Critical detail: each shard owns its own complete QuadTree covering the entire planet — but populated with only its 1/N slice of places. Shard 7's QuadTree has the same root-to-leaf structure as shard 12's, just with different place_ids in the leaves. A geo-search hits all N shards in parallel, each returns its local top-K candidates, the aggregator picks the global top-K from the N×K candidate pool.

Why fan-out is OK here: 100ms latency budget, 30 shards, parallel fan-out — each shard answers in 5ms in parallel, the aggregator's wall-clock cost is max(5ms) + merge ≈ 10ms. Network round-trips inside a data center are sub-millisecond. The fan-out tax is real but small, and it buys us uniform shard load — which is the much bigger win.

Step 10

Replication & Fault Tolerance

QuadTrees live in RAM. RAM is volatile. A QuadTree server crash means we just lost an in-memory index — we need to recover it fast.

🔁 Replication — 2× per shard

Every QuadTree shard runs on at least 2 servers (a primary and a hot replica). They both serve reads — doubling per-shard read capacity. Both are kept in sync by listening to the same Place DB change stream (Cassandra CDC, or a Kafka topic of place mutations). When a place is added/updated/deleted, both replicas of the responsible shard apply the change.

If the primary dies, the replica takes over. Latency blip of a few seconds for failover, no data loss because the index is rebuildable from Place DB.

🛠️ Recovery — rebuild from Place DB

A full failure of all replicas of a shard means we need to rebuild that shard from scratch. This is where the QuadTree Index Server ⑦ earns its keep. Instead of asking Place DB "show me every one of 500M places, let me check if it belongs to shard 7" (a full table scan), we ask the index server "give me every placeId whose home shard is 7" — it returns the ~16M relevant IDs, we stream just those rows from Place DB, walk the QuadTree insert algorithm 16M times, and the shard is back online in minutes, not hours.

The fault-tolerance trick: the QuadTree is treated as derived state, not source-of-truth. Place DB is the truth; QuadTrees are caches/indexes built from it. This means a QuadTree server can crash without consequence beyond a brief latency bump — we never have to fight to "save" its memory.

Step 11

Cache

Caching in Yelp comes in two flavors, both important.

🔥 Hot place objects

The QuadTree returns 20 place_ids; the API needs the full place objects (name, description, photo URL, current avg rating). Looking up 20 objects per search × 100K searches/sec = 2M lookups/sec on Place DB — too much. Memcached caches the hot ~5 million place objects (think: every popular tourist destination in every major city) with LRU eviction. ~95%+ hit rate; Place DB sees only the cold tail.

🔁 Hot search results

Many searches are repetitive: "ramen + downtown SF + 1 mile" runs thousands of times per minute from different users at roughly the same location. We bucket the user's location into a coarse grid cell (e.g., 200m squares) and cache (grid_cell, category, radius_bucket) → top-20 place_ids for 60 seconds. A massive amount of read traffic short-circuits before ever touching the QuadTree fleet.

Why we can cache search results at all: places change rarely. A new bakery opening up doesn't invalidate "best ramen near Union Square" — the ramen places are still there. So a 60-second TTL is harmless; the worst that happens is a brand-new restaurant takes a minute to show up in cached queries.

Step 12

Load Balancing

Two LB layers in this system, each playing a different role.

① Client → Search API

Public-facing LB (AWS ALB / nginx). Distributes incoming HTTPS across Search API pods. Health-checks every 5 seconds, terminates TLS, evicts unhealthy pods. Round-robin works fine because every API pod is stateless and identical.

② Aggregator → QuadTree Shards

Internal fan-out LB. Per-shard, the aggregator must pick one of the 2+ replicas to query. Round-robin at first; load-aware (least connections, or per-replica latency tracking) once you see one replica running consistently hotter than its peer due to noisy-neighbor effects on its host.

The LB is itself a single point of failure — solved with an active-passive pair using virtual IP failover, or natively in cloud LBs (ALB is multi-AZ by default).

Step 13

Ranking

The QuadTree finds nearby places. Ranking decides which nearby places to show first. A good ranking turns "we found 200 ramen places in your radius" into "here are the 20 you should actually look at".

Score function

A simple linear blend works surprisingly well as a baseline:

score(place, user) =
    w1 * normalized_rating       // 4.6 stars → 0.92
  + w2 * proximity_score          // inverse distance, capped
  + w3 * popularity               // log(review_count)
  + w4 * personalization          // user's preferences, history
  + w5 * sponsored_boost          // paid placement, capped

Each weight w_i is tuned via offline A/B and online experiments. Rating + popularity dominate the head of the distribution; personalization decides ties at the head and pushes specific places up the long tail.

Where ranking runs — local then global

Each QuadTree shard pre-ranks its candidates locally — say 100 candidates → top-100 by local score → returned to Aggregator. The Aggregator does final ranking across the N×100 candidates from N shards, picks the global top-20. This two-stage approach keeps shard-to-aggregator network traffic small (we send 100 ranked candidates per shard, not 10,000 unranked ones) without sacrificing global ranking quality.

Why pre-rank at the shard: bandwidth and latency. If each of 30 shards returned every candidate (could be thousands), the Aggregator would receive megabytes per query and spend most of its 100ms budget just reading the wire. Local pre-ranking trims each shard's contribution to a small constant.

Step 14

Interview Q&A

Why a QuadTree over a fixed grid?

Density adaptation. Places are wildly non-uniform — Manhattan has more restaurants per square kilometer than the Sahara has in a million. A fixed grid either drowns its dense cells (a Manhattan cell with 100,000 places defeats the index) or wastes space in its sparse ones. QuadTree adapts: dense areas split many times into tiny cells, sparse areas stay as single big cells. Every leaf has roughly the same count (~500), so query work is bounded everywhere.

Why a QuadTree over Geohash?

Both are valid; QuadTree is conceptually cleaner for this workload, geohash is operationally simpler. Geohash encodes (lat, lng) into a sortable string and lets you do prefix queries on a regular B-tree — no custom data structure required. It's easier to operate (just a column in your existing DB) but harder to make density-adaptive (variable-precision geohashing helps but is fiddly). QuadTree is custom code but gives you natural density-adaptation and explicit neighbor pointers. In an interview, mention both — picking QuadTree shows you understand the density problem, picking geohash shows you value operational simplicity. Either earns the point.

What if a leaf has 100K places — say, all the restaurants in NYC's 10 square blocks?

The threshold-based split solves this — but the question is whether 500 is the right threshold. With a 500-cap, a 100K-place region splits 8+ levels deep — totally fine; the tree just gets locally deeper. The actual operational concern is when a leaf is at 500 and 50 places get inserted in a microsecond (a new restaurant chain opens 50 locations at once) — that's a thundering-herd split. Mitigations: (a) over-provision the threshold to 1000 with a soft-cap at 500 to absorb bursts, (b) make splits async and serve queries from the unsplit leaf during the split, (c) shard the QuadTree across more boxes to reduce per-shard write rate.

How would the design change for Uber drivers, whose locations update every 3 seconds?

Drop the QuadTree, switch to geohash-keyed K/V with TTL. The QuadTree assumes near-static positions — every position update would trigger a delete + insert + possible leaf split + possible rebalance. With 1M drivers updating every 3s = 333K updates/sec, the tree would spend more time rebalancing than answering queries. Uber's actual design (per their engineering blog) uses geohash-prefix-keyed Redis structures: each driver's location is a Redis key like geohash_prefix → set-of-driver-ids, with TTL so stale locations auto-expire. Lookups become "fetch the driver-id sets for my geohash + neighbors". Loses density-adaptation but gains write-friendliness — exactly the right trade for that workload.

Shard by geographic region vs. shard by place_id — pros/cons?

Place_id wins on uniformity, region wins on locality. Region: search hits one shard (fast), but Manhattan shard melts (hot key). Place_id: search fans out to all N shards in parallel (small fan-out tax, ~5ms), but every shard has uniform load (no hot keys, no melting). At 100K QPS the hot-shard problem is the killer — even with the most aggressive sub-sharding of Manhattan, certain neighborhoods will always be over-represented. Hashing by place_id sidesteps the entire class of hot-region failure modes; the fan-out tax is small in a single data center. "In production I'd shard by place_id; if I were running on a network so slow that fan-out cost more than 20ms, I'd reconsider — but inside a modern DC, fan-out is essentially free."

How do you handle a search at the edge of two QuadTree leaves?

The radius spills into neighboring leaves; we follow leaf-list pointers. When the search circle isn't fully contained inside Sarah's home leaf, the algorithm walks to neighboring leaves via the doubly-linked leaf list (§7) and includes their candidates. Concretely: compute Sarah's bounding box from (lat, lng, radius), find every leaf whose rectangle intersects that box (could be 1, 4, or 9 leaves), gather candidates from all, merge. This is also why each shard returns a generous top-100 instead of top-20 — we want enough headroom that the post-merge global top-20 is correct even when results came from multiple leaves.

How do you build the QuadTree initially? Cold-start with 500M places?

Bulk-build, not insert-one-by-one. Inserting places one by one into an empty tree means cascading splits — inefficient. Instead, sort all 500M places by a space-filling curve (Hilbert or Z-order) so spatially-near places end up adjacent in the sorted list, then build the tree bottom-up by chunking the sorted list into 500-place leaves and recursively grouping leaves into parents. This takes minutes, not hours. Same trick used after a major shard re-shuffle.

How do you keep ranking fresh — new reviews, new ratings?

Async denormalization. When Sarah leaves a 5-star review, the Review service writes to its own table — instant. A background job (Kafka consumer) periodically rolls up (avg_rating, review_count) per place and updates the denormalized fields on the Place row in Place DB. The QuadTree itself doesn't store rating — it stores place_id; the Aggregator looks up the current rating from cache/Place DB at query time. So new reviews are reflected within seconds in cache misses, within 60s for cache hits (TTL). Heavier-weight personalization features (user history embeddings) can be recomputed nightly without breaking the user experience.

The one-line summary the interviewer remembers: "We build an in-memory QuadTree that adapts to place density — leaves split when they exceed 500 places — shard it by place_id across many servers, fan out queries through an aggregator, cache hot places and hot search results, and treat the entire QuadTree as derived state rebuildable from Cassandra. The naive lat-and-lng-with-two-indexes approach falls down on density skew and disk IOPS; the QuadTree fleet turns 100ms-on-bad-days into 100ms-as-the-p99-target."

Yelp / Nearby Places Search