From a naive SQL bounding-box query to a sharded, in-memory QuadTree fleet — how to find the 50 best ramen spots within 1 mile of you in 100ms across 500M places worldwide
This deep-dive applies the 4-step HLD interview framework. As you read, map each section to Requirements → Entities → APIs → High-Level Design → Deep Dives, and notice which of the 8 common patterns and key technologies are at play.
It's 14:02 on a Saturday. Sarah just landed at SFO, took the BART to Union Square, and her stomach is growling. She pulls out her phone, opens Yelp, types "ramen", and within 100 milliseconds her screen fills with 50 highly-rated restaurants — sorted by distance, all within walking range, complete with star ratings, photos, and review snippets. Behind that one tap is a system that just searched through 500 million places worldwide and returned the right 50 — fast enough that Sarah didn't even notice the wait. That's the system we're designing.
Yelp (and apps like Foursquare, Google Maps "Nearby", or Uber Eats restaurant discovery) lets users add places (a new bakery opens — someone registers it), discover nearby places via search (the central read workload), and read & write reviews with ratings, photos, and text. The hard part isn't storing places — a database can do that. The hard part is the geo lookup: given a latitude and longitude and a radius, find every place inside that circle, ranked by relevance, in under 100ms, while 100,000 other Sarahs around the world are doing the same thing simultaneously.
Before drawing a single box, pin down exactly what the system must do. In an interview, asking these questions shows you're not just regurgitating a memorized solution.
Numbers are not optional in HLD — they drive every architectural choice. We'll size the index, the cache, and the shard count based on these.
Assume 500 million places globally (restaurants + bars + shops + gyms + theaters + ...) with 20% growth/year. Assume 100,000 search QPS at peak — a busy Saturday evening across all time zones combined. Reviews are written at maybe 1,000 QPS — a rounding error compared to searches.
500M total
Growing 20%/year
100K/sec peak
Read-heavy by far
~1K/sec
New places + reviews
under 100ms
Real-time UX
Each Place row holds: id (8B), name (256B), latitude (8B), longitude (8B), description (512B), category (1B). Total: roughly ~793 bytes per place.
| Metric | Value | Why it matters |
|---|---|---|
| Total places | 500M | Drives index size and shard count |
| Place storage | ~400 GB | 500M × 793B; fits across modest fleet |
| Index size (lat+long+id) | ~12 GB | 500M × ~24B per leaf entry — fits in RAM! |
| Search QPS | 100K/s | Forces sharding + fan-out + cache |
| Reviews/photos | multi-TB | Stored separately from place index |
Three tables carry the entire data model. We deliberately split Place (small, hot, indexed by location) from Review and Photo (large, cold-ish, indexed by place_id) so the geo-index stays tiny and fast.
Why split Review and Photo from Place? Because Place gets indexed by a custom in-memory geo structure (the QuadTree we'll build in §6), and we don't want every byte of review text bloating that index. A place with 10,000 reviews still occupies a single tiny entry in the geo-index — its reviews and photos live in separate tables that we only read after the user picks a place.
id in the millions of times per second after a geo-lookup hands us a list of place IDs. That's a pure key-value access pattern — Cassandra or DynamoDB shine here. Reviews are similarly accessed by place_id. There are no joins, no transactions; everything is a denormalized lookup.One endpoint dominates: search. A few smaller endpoints for adding places and writing reviews. Defining the API contract early locks down what the architecture has to deliver.
// The big one — the heart of Yelp GET /api/v1/search ?api_key=... &search_terms=ramen // optional text filter &user_location=37.7879,-122.4075 &radius_filter=1600 // meters (1 mile) &max_results=20 &category_filter=restaurant &sort=rating // rating | distance | popularity &page_token=eyJv... // cursor for pagination → 200 OK { "results": [ { "id": "p_8a3e", "name": "Ippudo Ramen", "lat": 37.7882, "lng": -122.4068, "distance_m": 47, "rating": 4.6, "review_count": 2104, "thumbnail": "..." }, ... ], "next_page_token": "eyJv..." } // Add / update / delete a place — write path POST /api/v1/places { name, lat, lng, category, description } PUT /api/v1/places/:id { ...partial updates... } DELETE /api/v1/places/:id // Reviews POST /api/v1/places/:place_id/reviews { rating, text, photos[] }
search matters: the user's lat/lng is passed in (the device's GPS), not derived from IP. This is what makes geo-search work — the system knows exactly where Sarah is and can build a tight bounding query. The radius_filter defaults to 1 mile / 1600m for "Nearby Friends"-style queries; users can widen it.This is the section that wins or loses the interview. We build the architecture in three passes: the simplest thing that could plausibly work, why it falls apart, and the production shape where every box justifies itself. The key insight everything turns on is spatial indexing — partitioning the world geographically instead of treating latitude and longitude as two unrelated columns.
Sketch the simplest system: one MySQL database, one Place table with two indexes — one on latitude, one on longitude. To find places near Sarah, run a bounding-box query: SELECT * FROM places WHERE lat BETWEEN ? AND ? AND lng BETWEEN ? AND ?.
It looks innocent. It is not. Three concrete failures emerge the moment traffic shows up:
The query optimizer has to pick: use the lat index, or the lng index, or intersect both. The lat-index alone returns every place at Sarah's latitude — that's a horizontal stripe across the entire planet, millions of rows. Same for the lng-index. Even intersecting two giant result sets is slow — the DB ends up scanning huge intermediate sets to find the few hundred that actually match both filters.
Even if the query were perfect, a single MySQL box answering 100,000 spatial queries per second across 500M rows simply won't fit on one disk's IOPS budget. Cache misses turn into random disk seeks — 5ms each, 100K/sec means 500K disk seeks per second, which no spinning drive (and few SSDs) can sustain. The DB CPU pegs at 100% and p99 latency balloons into seconds.
The query BETWEEN lat1 AND lat2 selects a rectangle. Sarah asked for "1 mile radius" — that's a circle. The corners of the bounding rectangle contain places that are further than 1 mile from Sarah. We'd need a second filtering pass with the haversine formula. It's correct, but slow, and it shows the model is wrong: lat+lng are not independent, they describe a single 2D location.
The single most important insight in this design is to stop treating latitude and longitude as two independent columns and start treating them as a single 2D space — and then partition that space into tiles. A search becomes "which tile is Sarah in? give me the places in that tile and its neighbors." Two evolutions of this idea, in order of sophistication:
Divide the planet into a fixed N×N grid of equal-size cells. Each place is filed into the one cell that contains its (lat, lng). To search, find Sarah's cell and look at its places.
Why it fails: places are extremely non-uniform. Manhattan has thousands of restaurants per square kilometer; the Sahara has zero per million square kilometers. Fixed-size cells either drown in places where it's dense, or are mostly empty where it's not. We need cells that adapt to density.
Start with one cell covering the world. When a cell exceeds a threshold (say, 500 places), split it into 4 equal sub-cells (NW, NE, SW, SE). Recursively. A leaf cell with fewer than 500 places stays as-is.
Why it wins: every leaf has roughly the same number of places — Manhattan ends up with thousands of tiny leaves, the Sahara with one giant leaf. Search is fast everywhere because the work-per-leaf is bounded. It's a recursive map-zoom, materialized as a tree.
Think of a QuadTree as a recursive map zoom: when you zoom out you see one giant tile (one node); when you zoom into Manhattan, that tile splits into 4, and each of those splits into 4 again, until each tile is small enough to read at a glance. The deeper the zoom, the more places you cram per tile. The QuadTree codifies that intuition into a data structure we can query in O(log N) time.
Now the full picture. The QuadTree alone is the index — but we also need an API tier, a fan-out aggregator (because we'll shard the QuadTree across many servers), the source-of-truth Place DB, a cache, a review/photo service, and a ranking service. Every node is numbered — find its matching card below to see what it does and what would break without it.
Use the numbers in the diagram above to find the matching card. Each one answers what is this, why is it here, and what would break without it.
Sarah's iPhone, the Yelp web app, an embedded "places near me" widget. The client knows the user's GPS coordinates (from the OS) and packages them into a GET /api/v1/search request. From the client's view, the entire system is one URL — it sends location + filters and expects a ranked list of nearby places back in 100ms.
Solves: nothing on its own — but every design choice flows backward from "what does the client experience?" Latency, the JSON shape, and the radius semantics are all client-facing concerns.
The traffic cop. Sits in front of the Search API tier, distributes incoming HTTPS, and yanks unhealthy backends out of rotation via health checks. Round-robin is fine to start; switch to least-connections once you have enough nodes that load skew matters. Terminates TLS so backend pods don't pay the crypto cost.
Solves: single-server bottleneck and single-server failure. Without an LB, one API crash takes down the whole service. With it, we lose 1/N of capacity for a few seconds until health checks fail the bad node out.
Stateless service that owns the public /api/v1/search contract. It validates the user's input (lat/lng look real, radius is bounded, api_key is valid), hands the structured query down to the Aggregator, and shapes the response back into the JSON the client expects. Scales horizontally — just add more pods behind the LB.
Solves: a clean public-facing surface that's decoupled from how the index is laid out internally. We could swap the QuadTree for Geohash tomorrow and the API contract wouldn't change.
The brain of the read path. Given a (lat, lng, radius), it figures out which QuadTree shards could possibly contain matching places (because the QuadTree is sharded across many servers), fans out the query to those shards in parallel, collects each shard's top-K candidates, and merges them into a single ranked list. Think of it as the host at a buffet — it doesn't cook anything, it just runs to every station, grabs a plate, and presents the combined tray.
Solves: the fan-out problem. The QuadTree is too big to live on one server — it's sharded by place_id across, say, 30 shards. A geo-query may touch 1, 2, or N of them depending on how the search circle straddles shard boundaries. Without an aggregator, the Search API would have to know the shard topology and run the fan-out itself — bloating its responsibility and making sharding changes painful.
The actual spatial index. Each server holds a partition of the global QuadTree fully in RAM — every leaf node, every internal node, every place_id stored under each leaf. Replies to a query like "give me the top-K places within this bounding box that match these filters" with an in-memory tree walk that takes single-digit milliseconds. Replicated 2× per shard for fault tolerance.
Solves: the actual geo-search problem. This is where Pass 2's mental model becomes code. Without it, every search would fall back to a SQL bounding-box scan over 500M rows — back to Pass 1's 100ms-becomes-1000ms disaster. The QuadTree's recursive subdivision is what makes "find 50 places near here" cost 5ms instead of 500ms.
The source of truth — a partitioned, replicated NoSQL key-value store holding all 500M place records (id, name, lat, lng, description, category). Sharded by place_id via consistent hashing, replicated 3× across availability zones. Critically, the QuadTree shards do not own the place data; they own only an index pointing at place_ids. If a QuadTree server crashes, we rebuild it from the Place DB.
Solves: durable, scalable, query-by-id storage. Cassandra spreads 400 GB of place data across dozens of nodes and handles failures automatically. The QuadTree gets to be a fast in-memory index without also being a database.
A small lookup service that maintains the reverse mapping placeId → quadtreeShardId. When a QuadTree server dies and we need to rebuild it from the Place DB, we don't want to re-shuffle every one of 500M places asking "do you belong here?" — instead we ask the index server "give me every placeId whose home shard was 7" and stream just those rows from Place DB.
Solves: fast crash recovery. Without it, rebuilding one failed QuadTree shard would require scanning the entire 500M-row Place DB to find which 16M places belong to it. With it, recovery is a targeted reload of just those 16M rows — minutes, not hours.
Two layers of caching. (a) Hot place objects — when a search returns 20 place_ids, we need to look up the full place object for each. The hottest few million places (popular tourist spots, viral restaurants) live in Memcached, so most lookups never hit Place DB. (b) Hot search results — common queries like "ramen + downtown San Francisco" repeat thousands of times per minute; we cache the top-20 results for the canonical version of these queries, keyed by (city, category, radius_bucket).
Solves: read amplification. Each search returns ~20 place_ids, so 100K searches/sec means 2M place lookups/sec — that would melt Cassandra. Memcached absorbs 95%+ of those lookups, leaving Cassandra sized for the cold tail.
A separate microservice owning reviews and user-uploaded photos, with photos stored in a blob store (S3) and review text in its own table keyed by place_id. Critically, this is on the detail-page path, not the search path — when Sarah opens Yelp she only sees thumbnails (one cached photo URL per place). Only when she taps a specific place does the client fetch full reviews and photos.
Solves: keeping the search path light. If we co-located review text and photo blobs with the place index, every search response would balloon to megabytes. Splitting them lets the search response stay tiny (a few KB for 20 places) and lets photo storage scale on its own clock — multi-TB without blowing up our hot index.
Once the QuadTree shards return a set of candidate place_ids, those candidates need to be ranked. Score = f(rating, distance, popularity, personalization, sponsored boost). The Ranking Service computes a score per candidate and returns the top-K. Each shard pre-ranks its own candidates and returns its local top-K (say, 100); the Aggregator then merges those K-per-shard into a final top-20.
Solves: turning "places that are nearby" into "places Sarah probably wants". Two restaurants 100m from Sarah, one with 4.8 stars and 5,000 reviews, one with 3.1 stars and 12 reviews — proximity alone would tie them. Ranking surfaces the right one first. Personalization (Sarah is vegetarian; she rates highly-vegan places higher) goes here too.
Sarah's exact GPS is (37.7879, -122.4075). She wants 20 ramen spots within 1600m, sorted by rating. Trace it through the numbered components:
GET /api/v1/search?lat=37.7879&lng=-122.4075&radius=1600&category=ramen&sort=rating.{ramen, San_Francisco_grid_cell, 1mi_bucket}. Cache hit on a hot query → return result, done in 15ms. We'll assume miss for the rest.GET /api/v1/places/p_8a3e/reviews./api/v1/places/p_8a3e/reviews.The QuadTree is the heart of this system. Time to crack it open and understand exactly how a tree of geographic rectangles becomes a sub-millisecond proximity index.
Every node in a QuadTree represents a rectangular region of the planet. The root represents the entire world. Each internal node has up to 4 children — its rectangle split into Northwest, Northeast, Southwest, Southeast quadrants. A leaf node holds an actual list of place_ids (and their lat/lng/category for fast filtering) — but only up to a threshold. We pick 500 places per leaf as the sweet spot: small enough to scan in microseconds, large enough that the tree doesn't get pathologically deep.
Given Sarah's (lat, lng) and radius, find the nearby places:
function searchNearby(node, lat, lng, radius, results): if node.rectangle does not intersect circle(lat, lng, radius): return // prune entire subtree if node.isLeaf(): for place in node.places: // at most 500 if distance(place, lat, lng) <= radius: results.add(place) return for child in node.children: // up to 4 quadrants searchNearby(child, lat, lng, radius, results)
The magic is the prune step. If a node's rectangle doesn't even touch Sarah's search circle, we skip the entire subtree — possibly millions of places eliminated in one pointer-comparison. This is why the QuadTree gives us O(log N) instead of O(N).
One subtle but important optimization: every leaf node maintains pointers to its geographically adjacent leaves (a doubly-linked list across the leaves). When Sarah's 1-mile radius spills out of her current leaf into a neighbor, we don't have to walk back up to the root and down again — we just follow the leaf pointer sideways. This turns "walk neighbors" from O(log N) per neighbor into O(1).
(place_id 8B, lat 8B, lng 8B) = 24B; 500M × 24B ≈ 12 GB of leaf data. Internal nodes with 4 child pointers each are negligible (~10 MB). Total ~12 GB — fits in the RAM of one fat server, but we'll partition for fault tolerance and to spread the 100K QPS load.A new bakery opens in San Francisco. How does it enter the QuadTree?
Walk the QuadTree from the root: at each internal node, pick the child quadrant whose rectangle contains the new place's (lat, lng). Continue until you hit a leaf. Append the place to that leaf's list. If the leaf now exceeds 500 places, split it: create 4 new sub-quadrants, redistribute the 501 places among them by location, and turn the former leaf into an internal node pointing at the 4 new leaves.
The insert is O(log N) — at most 15 hops down a tree of ~1M leaves.
Updating a place's name or category is just a Place DB write — the QuadTree doesn't care. Updating a place's location (rare) means: remove from old leaf, walk down again, insert into new leaf. Deleting is: remove from leaf; if the leaf falls under some lower threshold (say, 100 places) and its 3 sibling leaves are also sparse, merge them back into a single leaf to keep the tree balanced.
The QuadTree is ~12 GB and must serve 100K QPS — both numbers fit on one fat server, but we'd have no fault tolerance and no headroom for growth. We must shard. Two natural choices:
"Shard 1 owns North America, shard 2 owns Europe, shard 3 owns Asia..." Intuitively appealing — a search in San Francisco only ever hits the North America shard.
Why it fails: the hot region problem. Manhattan has more places (and more searches) than the entire state of Wyoming. The Manhattan shard melts under load while the Wyoming shard sits idle. Adding more shards to North America just relocates the problem — geography is fundamentally non-uniform.
Compute shard = consistent_hash(place_id) % N. Distribution is uniform by construction. Each shard has its own QuadTree containing roughly 1/N of all places — globally distributed, not regionally clustered.
Trade-off: a search at Sarah's location must fan out to all shards in parallel, because the places near her are spread uniformly across every shard. The aggregator (④) merges results. This is the cost of uniformity, and it's why the aggregator exists.
Critical detail: each shard owns its own complete QuadTree covering the entire planet — but populated with only its 1/N slice of places. Shard 7's QuadTree has the same root-to-leaf structure as shard 12's, just with different place_ids in the leaves. A geo-search hits all N shards in parallel, each returns its local top-K candidates, the aggregator picks the global top-K from the N×K candidate pool.
QuadTrees live in RAM. RAM is volatile. A QuadTree server crash means we just lost an in-memory index — we need to recover it fast.
Every QuadTree shard runs on at least 2 servers (a primary and a hot replica). They both serve reads — doubling per-shard read capacity. Both are kept in sync by listening to the same Place DB change stream (Cassandra CDC, or a Kafka topic of place mutations). When a place is added/updated/deleted, both replicas of the responsible shard apply the change.
If the primary dies, the replica takes over. Latency blip of a few seconds for failover, no data loss because the index is rebuildable from Place DB.
A full failure of all replicas of a shard means we need to rebuild that shard from scratch. This is where the QuadTree Index Server ⑦ earns its keep. Instead of asking Place DB "show me every one of 500M places, let me check if it belongs to shard 7" (a full table scan), we ask the index server "give me every placeId whose home shard is 7" — it returns the ~16M relevant IDs, we stream just those rows from Place DB, walk the QuadTree insert algorithm 16M times, and the shard is back online in minutes, not hours.
Caching in Yelp comes in two flavors, both important.
The QuadTree returns 20 place_ids; the API needs the full place objects (name, description, photo URL, current avg rating). Looking up 20 objects per search × 100K searches/sec = 2M lookups/sec on Place DB — too much. Memcached caches the hot ~5 million place objects (think: every popular tourist destination in every major city) with LRU eviction. ~95%+ hit rate; Place DB sees only the cold tail.
Many searches are repetitive: "ramen + downtown SF + 1 mile" runs thousands of times per minute from different users at roughly the same location. We bucket the user's location into a coarse grid cell (e.g., 200m squares) and cache (grid_cell, category, radius_bucket) → top-20 place_ids for 60 seconds. A massive amount of read traffic short-circuits before ever touching the QuadTree fleet.
Two LB layers in this system, each playing a different role.
Public-facing LB (AWS ALB / nginx). Distributes incoming HTTPS across Search API pods. Health-checks every 5 seconds, terminates TLS, evicts unhealthy pods. Round-robin works fine because every API pod is stateless and identical.
Internal fan-out LB. Per-shard, the aggregator must pick one of the 2+ replicas to query. Round-robin at first; load-aware (least connections, or per-replica latency tracking) once you see one replica running consistently hotter than its peer due to noisy-neighbor effects on its host.
The QuadTree finds nearby places. Ranking decides which nearby places to show first. A good ranking turns "we found 200 ramen places in your radius" into "here are the 20 you should actually look at".
A simple linear blend works surprisingly well as a baseline:
score(place, user) =
w1 * normalized_rating // 4.6 stars → 0.92
+ w2 * proximity_score // inverse distance, capped
+ w3 * popularity // log(review_count)
+ w4 * personalization // user's preferences, history
+ w5 * sponsored_boost // paid placement, capped
Each weight wi is tuned via offline A/B and online experiments. Rating + popularity dominate the head of the distribution; personalization decides ties at the head and pushes specific places up the long tail.
Each QuadTree shard pre-ranks its candidates locally — say 100 candidates → top-100 by local score → returned to Aggregator. The Aggregator does final ranking across the N×100 candidates from N shards, picks the global top-20. This two-stage approach keeps shard-to-aggregator network traffic small (we send 100 ranked candidates per shard, not 10,000 unranked ones) without sacrificing global ranking quality.
geohash_prefix → set-of-driver-ids, with TTL so stale locations auto-expire. Lookups become "fetch the driver-id sets for my geohash + neighbors". Loses density-adaptation but gains write-friendliness — exactly the right trade for that workload.(avg_rating, review_count) per place and updates the denormalized fields on the Place row in Place DB. The QuadTree itself doesn't store rating — it stores place_id; the Aggregator looks up the current rating from cache/Place DB at query time. So new reviews are reflected within seconds in cache misses, within 60s for cache hits (TTL). Heavier-weight personalization features (user history embeddings) can be recomputed nightly without breaking the user experience.