Median of 10 Billion Integers

The Setup

It's Friday afternoon. Maya gets handed 50 files.

"We need the median of all of it. Before Monday." Maya looks at the disk: 50 files, 10 billion 32-bit integers in total — about 40 gigabytes of raw data. Her laptop has 16 GB of RAM. The math is unkind.

The first instinct of every engineer who has ever passed a coding interview is the same: just sort it and pick the middle one. That instinct is correct on a whiteboard with 7 numbers. It is catastrophically wrong with 10 billion. The whole point of this article is to show you why it's wrong, and what the right shape of the answer looks like — without any black magic, just one simple idea applied twice.

Before we touch a line of code, let's get the units straight. They will matter for every decision we make later.

N (total ints)

10,000,000,000 — ten billion

Width per int

4 bytes (32-bit signed)

Total bytes

40 GB on disk

RAM available

16 GB — less than half the data

Files

50 (~800 MB each, sequential read-friendly)

Median definition

value at sorted position ⌈N/2⌉ (the lower median — keeps the math simple)

The brutal constraint. You cannot hold the data in memory. You cannot afford a random walk over disk — random reads at 40 GB scale will run for hours. Whatever you do, it has to be a small number of sequential passes through the files.

The Wrong Instinct

"Just sort it" — and the wall it runs into

Maya's first attempt looks like every textbook answer: load the data, sort it, return arr[N/2]. Watch what happens at each step.

flowchart LR A[50 files · 40 GB] --> B[Load into RAM] B --> C{Fits in 16 GB?} C -->|No| X[💥 OutOfMemory] style X fill:#e05252,stroke:#e05252,color:#fff style A fill:#171d27,stroke:#4a90d9,color:#d4dae5 style B fill:#171d27,stroke:#7b8599,color:#d4dae5 style C fill:#171d27,stroke:#d4a838,color:#d4dae5

Right. So we don't load it all. The next-best textbook answer is external merge sort — break the input into RAM-sized chunks, sort each chunk, write each sorted chunk back to disk, then merge them all into one giant sorted file. This is a real algorithm; databases use it. Let's count what it costs us.

📦 Chunk & sort

Read 40 GB once. Sort each ~10 GB chunk in RAM. Write 40 GB back. I/O so far: 80 GB.

🔀 K-way merge

Open all sorted runs at once and merge them with a min-heap, writing one giant sorted file. Read 40 GB, write 40 GB. Another 80 GB of I/O.

🎯 Walk to position N/2

Open the sorted file, skip 5 billion entries, read the median. Up to 20 GB more I/O.

🧾 The bill

~180 GB of disk I/O and a 40 GB sorted file you'll never use again. On a 500 MB/s SSD, that's 6+ minutes of pure I/O — and we threw away most of the work.

This is the trap: external merge sort produces an artifact (a fully sorted file) that is far more powerful than the question asked for. We don't need every element in order. We need one element — the one at rank N/2. Paying to sort all 9,999,999,999 of the others is wasted work.

The reframe is coming. If sorting feels expensive, it's because you're answering a harder question than you were asked. The next section shows what you were actually asked.

The Reframe

You don't need order. You need rank.

Here is the single insight that unlocks everything: the median is defined purely by its position in the sorted list, not by the sort itself. You only need to find the value v such that exactly N/2 values in the dataset are ≤ v. You don't need to know what's to the left of it. You don't need to know what's to the right of it. You just need to know where the boundary sits.

Think of it like this analogy. Imagine you ask 10 billion people their height and want to find the median height. You don't need to line them all up in a row. You can just ask each person to step into the room labeled with their height-bracket — under 150 cm, 150–160, 160–170, and so on. Then you walk the rooms in order and count: "room 1 has 1.8 billion, room 2 has 2.0 billion, room 3 has 2.5 billion — that's 6.3 billion, we've crossed the halfway mark of 5 billion in room 3, the median is somewhere in this room."

You never sorted anyone. You just counted who landed where. That's the entire trick.

Counting is cheap. Sorting is expensive. If you can replace "sort and look up position K" with "count how many fall into each bracket and find which bracket position K lives in", you've replaced an O(N log N) operation with an O(N) one — and, more importantly, an operation that needs almost no RAM.

The Idea

Buckets — voting boxes for numbers

A 32-bit signed integer has a fixed, finite range: from −2³¹ (about −2.15 billion) to 2³¹ − 1 (about +2.15 billion). That's roughly 4.3 billion possible distinct values. Bounded range is the magic word — it means we can pre-define a fixed grid of buckets that covers every possible value, no surprises.

Here is the picture. Carve up the entire integer range into 2¹⁶ = 65,536 equal-sized buckets. Each bucket is just a counter — an entry in an array. When we read a number v from disk, we figure out which bucket it belongs to, and increment that counter. That's it. We never store v itself.

flowchart LR F1[(File 1)] --> S F2[(File 2)] --> S Fn[(File 50)] --> S S[Stream every int
one at a time] --> H[Hash to bucket index
idx = high 16 bits of v] H --> B[bucket-array
65,536 counters · 512 KB] style S fill:#171d27,stroke:#4a90d9,color:#d4dae5 style H fill:#171d27,stroke:#e8743b,color:#d4dae5 style B fill:#171d27,stroke:#38b265,color:#d4dae5

Why "high 16 bits"?

A 32-bit integer is just two halves stitched together: 16 high bits and 16 low bits. The high bits tell you the coarse position of the number (which broad bracket it's in); the low bits tell you the fine position within that bracket. If you split the value range into 65,536 equally-sized buckets, then which bucket a number falls into is determined entirely by its high 16 bits.

The bucket index is just one CPU instruction:

// for unsigned thinking; we'll handle signed offset in a moment
int idx = (v >>> 16) & 0xFFFF;   // high 16 bits, value in [0, 65535]
buckets[idx]++;

💾 RAM for the bucket array

65,536 buckets × 8 bytes per long counter = 512 KB. That fits in your CPU's L2 cache, never mind your RAM. The 16 GB constraint is a non-issue.

⚡ CPU per element

One bit shift, one mask, one array increment. We can process numbers as fast as the disk can hand them to us. The job is now I/O-bound, not CPU-bound.

Where this leaves us. After one pass through all 40 GB, we have a 65,536-entry array that says, for every coarse bracket, exactly how many of the 10 billion numbers landed there. That summary is enough to locate the median's bracket — but not yet to name the exact value. That's what Pass 2 is for.

Pass 1

Coarse histogram — find the bracket

Pass 1 is a single sequential read of all 50 files. For every integer v we see, we increment buckets[high16(v)]. At the end of the pass, the array tells us how the 10 billion values are distributed across 65,536 brackets.

Now we go hunting for the median. We walk the buckets in order, keeping a running sum. The moment that sum first reaches N/2, the bucket we just stepped into is the one that contains the median.

long target = N / 2;
long cumulative = 0;
int medianBucket = -1;
long countBefore = 0;

for (int i = 0; i < 65536; i++) {
  if (cumulative + buckets[i] >= target) {
    medianBucket = i;
    countBefore = cumulative;        // values strictly before this bucket
    break;
  }
  cumulative += buckets[i];
}

Two things come out of this loop, and we'll use both in Pass 2:

🎯 `medianBucket`

The index of the bucket containing the median value. We now know the median's coarse location — the high 16 bits of its value are pinned down.

🔢 `countBefore`

How many of the 10 billion numbers landed in earlier buckets. The median is the (target − countBefore)-th smallest within medianBucket.

What about negative numbers?

A small detail that trips people up. With signed 32-bit ints, the bit pattern 0x80000000 is the most-negative number, and 0x7FFFFFFF is the most-positive. If you bucket by raw high-16-bits, the negatives land at the top of the array, not the bottom — which would silently break the "walk in order" loop above.

The fix is a one-liner at read time: add 2³¹ to every value (or equivalently, XOR the sign bit). That maps the signed range [−2³¹, 2³¹−1] to the unsigned range [0, 2³²−1], preserving order. Now bucket index 0 holds the most negative numbers, bucket 65,535 holds the most positive, and the cumulative walk works correctly. We undo the offset only at the very end when we report the answer.

So what does Pass 1 give us? A 512 KB summary of 40 GB of data, plus two numbers: which bucket the median is in, and how many values are below that bucket. That's enough to throw away 65,535 of the 65,536 buckets and zoom in on just one.

Pass 2

Zoom in — find the exact value

Pass 1 narrowed the median to one of 65,536 brackets. Each bracket is 2¹⁶ = 65,536 values wide. We now do the same trick again — but this time we only care about values that land in medianBucket, and we sub-bucket them by their low 16 bits.

It's the same algorithm, scoped down. We allocate another 65,536-entry counter array (sub[]). We re-stream all 50 files. For each integer, we filter — if its high 16 bits don't match medianBucket, we ignore it. Otherwise, we extract its low 16 bits and increment sub[low16(v)].

long[] sub = new long[65536];

for (int v : streamAllFiles()) {
  int hi = (v >>> 16) & 0xFFFF;
  if (hi == medianBucket) {
    int lo = v & 0xFFFF;
    sub[lo]++;
  }
}

After Pass 2, sub[] tells us how many copies of every individual value in medianBucket exist in the dataset. We do one last cumulative walk:

long needRank = target - countBefore;   // rank within medianBucket
long cum = 0;
int medianLow = -1;

for (int i = 0; i < 65536; i++) {
  cum += sub[i];
  if (cum >= needRank) { medianLow = i; break; }
}

int median = (medianBucket << 16) | medianLow;   // stitch back together

Two 16-bit halves combine to give the exact 32-bit median value. Done — in two sequential reads of the data, with under 1 MB of RAM total.

Why two passes is the natural number. Each pass narrows the search space by a factor of 65,536 (16 bits at a time). Two passes is enough to pin down all 32 bits. If we were dealing with 64-bit longs, we'd just do four passes — same idea, more zoom levels.

Worked Example

Tiny numbers, full trace

Algorithms feel abstract until you watch them eat real numbers. Let's shrink the problem so the whole thing fits on a screen.

Pretend our integers are 4 bits wide (range [0, 15]) and we have N = 16 of them across two files. We'll use 4 buckets in each pass instead of 65,536, but the math is identical.

File 1

3, 1, 4, 1, 5, 9, 2, 6

File 2

5, 3, 5, 8, 9, 7, 9, 3

16 → target rank = N/2 = 8 (the 8th smallest)

Sorted (just to verify)

1,1,2,3,3,3,4,5,5,5,6,7,8,9,9,9 → answer should be 5

Pass 1 — coarse buckets (high 2 bits)

With 4 buckets, each covers 4 values. The "bucket index" of a number is just v >> 2 (the high 2 bits of a 4-bit number).

Bucket	Range	Values that land here	Count	Cumulative
`0`	`0–3`	1, 1, 2, 3, 3, 3	`6`	`6`
`1`	`4–7`	4, 5, 5, 5, 6, 7	`6`	`12` ← crosses `8` here
`2`	`8–11`	8, 9, 9, 9	`4`	`16`
`3`	`12–15`	—	`0`	`16`

Cumulative count first reaches the target (8) at bucket 1. So:

medianBucket

1

countBefore

6 (elements in earlier buckets)

needRank

target − countBefore = 8 − 6 = 2 → the median is the 2nd smallest within bucket 1

Pass 2 — sub-buckets (low 2 bits, only for bucket 1)

Re-stream both files. Filter to values in bucket 1 (range 4–7). Increment sub[v & 0b11] for each.

Sub-bucket (low 2 bits)	Actual value	Count	Cumulative
`0`	`4` (bucket_base + 0)	`1`	`1`
`1`	`5`	`3`	`4` ← crosses `2` here
`2`	`6`	`1`	`5`
`3`	`7`	`1`	`6`

Cumulative first reaches needRank = 2 at sub-bucket 1. The actual value is (medianBucket << 2) | 1 = (1 << 2) | 1 = 5. ✓

Two passes, two tiny arrays, exact answer. Notice we never sorted anything. We never moved a single value. We just counted who landed where, and walked the counters in order. Scale the array sizes from 4 to 65,536 and the integers from 4-bit to 32-bit, and you have the production algorithm.

The Numbers

Memory, I/O, and wall-clock time

Now that the shape of the algorithm is clear, let's price it out for the real 40 GB problem.

Resource	External merge sort	Bucket counting (this article)
Peak RAM	Tunable, typically 1–4 GB for buffers	`~1 MB` (two 65k-entry long arrays)
Disk reads	`~120 GB` (chunk-read + merge-read + walk)	`~80 GB` (two streaming passes)
Disk writes	`~80 GB` (sorted runs + final sorted file)	`0` — no intermediate output
CPU per element	O(log N) inside heap-merge	`O(1)` — shift, mask, array increment
Time on a 500 MB/s SSD	~7 minutes (I/O-dominated)	`~3 minutes`
Parallel-friendly?	Sort phase yes; merge phase serial	Yes — both passes embarrassingly parallel

Why the I/O numbers matter more than the CPU numbers

At 40 GB, every second of wall-clock time is dictated by how many bytes you move. A modern NVMe SSD reads at 3–5 GB/s sequential; a SATA SSD at 500 MB/s; a network-attached blob store at maybe 200 MB/s. The bucket method's win isn't that it's smarter — it's that it does less I/O and the I/O it does is purely sequential, which is the fastest mode on every storage medium.

How to actually read the files fast

Three rules of thumb that turn this from "an algorithm" into "a program that finishes":

›Big buffered reads. Use 8–64 MB read buffers. Don't read int-by-int; read a slab, parse it in memory.
›One thread per file. 50 files, 50 reader threads. Each thread keeps a local 65k bucket array, then a final reduction step sums them. No contention, no locks.
›Memory-mapped I/O if your OS permits it. mmap lets the kernel page in file pages on demand and skip a userspace copy. For sequential scans of bounded-size files it's often the single biggest win.

Nuances

The edge cases that get asked in interviews

The two-pass story is the headline. The rest of this section is the small print that separates a good answer from a great one.

① Even N — two middles

If N is even, the median is conventionally defined as the average of the two middle values (rank N/2 and N/2+1). Track both ranks during the cumulative walk. Often they fall in the same final sub-bucket — answer is that value. If they fall in different sub-buckets, it's the average of the two values.

② Long (64-bit) integers

You can't bucket the full 64-bit range in one shot — too many buckets. Solution: do four passes of 16 bits each (high → mid-high → mid-low → low). Or 3 passes with 21-bit slices. Same algorithm, just one more zoom level. RAM is still kilobytes per pass.

③ Severe skew — one giant bucket

What if Pass 1 dumps 6 of the 10 billion values into a single bucket (e.g., almost all zeros)? Pass 2 still works — the sub-bucket array doesn't grow with input size, only with bucket width. But if even one sub-bucket is huge, you simply recurse: 16 bits → 8 bits → 8 bits. The algorithm is self-correcting — never run out of zoom levels until you're naming a single value.

④ Negative numbers (offset binary)

Mentioned in §5: signed integers don't bucket in natural order without a fix. Add 2³¹ at read time to map [−2³¹, 2³¹−1] → [0, 2³²−1], preserving order. Subtract it back at the end. One arithmetic op per element — basically free.

⑤ Duplicates

Trivially handled. The bucket counter just increments more often. Counting doesn't care about distinctness.

⑥ Parallelism — the embarrassing kind

Pass 1 has zero cross-thread state. Each thread keeps its own 512 KB bucket array; at the end of the pass, sum them element-wise. With 50 files and 16 cores, you're disk-bound long before you're CPU-bound — the algorithm scales with how fast you can pull bytes off storage.

Pointers worth knowing about

›Quickselect on disk. The classical alternative — pick a pivot, partition, recurse on the half containing rank K. Average-case great, worst-case awful (bad pivot choice can blow up). Bucket counting is deterministic and pivot-free; that's why it wins at scale.
›This is just radix selection. What you've learned is the disk-friendly form of radix selection (a cousin of radix sort). The "two passes of 16 bits" is two iterations of radix narrowing. Useful term to know.
›MapReduce-able. Pass 1 is a map (emit bucket-index per int) followed by a reduce (sum bucket counts). Pass 2 is the same with a filter. Drop this into Spark / Flink / a Hadoop job and you can scale it to petabytes.
›Generalizes to any percentile. The 50th percentile (median) was illustrative. Replace target = N/2 with target = N * p for the p-th percentile. Same two passes. Useful when you need P95, P99, etc. on huge logs.

Limits

When buckets don't work — and what replaces them

The bucket method has one hard prerequisite: the values must come from a bounded, discretizable range that you can carve into buckets ahead of time. When that breaks, you need a different tool. Three real-world cases:

🌊 Floating-point numbers

Floats can be reinterpreted as integers (the IEEE 754 bit pattern is monotonic for non-negative floats), so you can apply the same trick — but the math is fiddly: handle NaN, ±0, denormals, and flip bits for negatives. Doable, but most teams give up exact answers and use approximation.

📈 Unbounded ranges (BigInteger, strings)

If values can be arbitrarily large, you can't pre-allocate a bucket grid. Fall back to quickselect-on-disk with sampled pivots, or external merge sort, or sketches.

⏱️ Streaming / single-pass requirement

Sometimes you can't make two passes — the data is a live stream. You give up exact median for an approximate one. t-digest, P², and HDR Histogram are the standard tools. Each gives you any percentile, with bounded error, in fixed memory, in one pass.

🤝 "Close enough" is enough

If your business doesn't actually need the exact median to the integer — just "around 5,000-ish, ±1%" — random reservoir sampling is the simplest correct answer. Sample 10 million numbers uniformly, sort the sample in RAM, take the middle. Ten lines of code, near-perfect accuracy for typical distributions.

The decision gate. Bounded range + need exact answer + can do a few passes → bucket counting wins. Anything else (floats, unbounded values, streams, "close is fine") → reach for sketches or sampling instead.

Recap

The 60-second version

If someone stops you in the hallway and asks you to explain the whole article in a minute, here's the script.

The problem. 10 billion 32-bit integers across 50 files, 40 GB of data, 16 GB RAM. Find the median. Sorting is wasteful — we don't need the order, just the value at rank N/2.

The trick. A 32-bit integer's range is bounded. Carve it into 65,536 brackets (one per high-16-bits value) and just count how many numbers land in each bracket. Counting is O(N) and needs ~512 KB of RAM — trivial.

Pass 1. Stream all 50 files once. For each int, increment buckets[high16(v)]. Walk the bucket array cumulatively — the bucket where the running sum first crosses N/2 contains the median.

Pass 2. Re-stream the files. Only count values whose high 16 bits match the median bucket. Sub-bucket them by their low 16 bits. Walk the sub-array cumulatively. The sub-bucket where the count crosses (N/2 − countBefore) is the exact median value. Stitch the high and low halves back together — done.

The cost. Two streaming reads of the data (~80 GB I/O), under 1 MB of RAM, no intermediate writes, embarrassingly parallel across files.

The general idea. When you only need a positional answer (median, percentile, k-th smallest) over a large dataset with a bounded value range, replace sorting with counting. Counting fits in cache. Sorting fits in pain.

One sentence to remember. Don't sort 40 GB to find one number — bucket-count by the high half to locate the median's neighborhood, then bucket-count the low half within just that neighborhood. Two passes, kilobytes of RAM, no sorting.