50 files. 40 GB of data. 16 GB of RAM. No, you can't sort it. Here's how bucket counting cracks it in two clean passes — explained from scratch.
"We need the median of all of it. Before Monday." Maya looks at the disk: 50 files, 10 billion 32-bit integers in total — about 40 gigabytes of raw data. Her laptop has 16 GB of RAM. The math is unkind.
The first instinct of every engineer who has ever passed a coding interview is the same: just sort it and pick the middle one. That instinct is correct on a whiteboard with 7 numbers. It is catastrophically wrong with 10 billion. The whole point of this article is to show you why it's wrong, and what the right shape of the answer looks like — without any black magic, just one simple idea applied twice.
Before we touch a line of code, let's get the units straight. They will matter for every decision we make later.
10,000,000,000 — ten billion4 bytes (32-bit signed)40 GB on disk16 GB — less than half the data50 (~800 MB each, sequential read-friendly)⌈N/2⌉ (the lower median — keeps the math simple)Maya's first attempt looks like every textbook answer: load the data, sort it, return arr[N/2]. Watch what happens at each step.
Right. So we don't load it all. The next-best textbook answer is external merge sort — break the input into RAM-sized chunks, sort each chunk, write each sorted chunk back to disk, then merge them all into one giant sorted file. This is a real algorithm; databases use it. Let's count what it costs us.
Read 40 GB once. Sort each ~10 GB chunk in RAM. Write 40 GB back. I/O so far: 80 GB.
Open all sorted runs at once and merge them with a min-heap, writing one giant sorted file. Read 40 GB, write 40 GB. Another 80 GB of I/O.
Open the sorted file, skip 5 billion entries, read the median. Up to 20 GB more I/O.
~180 GB of disk I/O and a 40 GB sorted file you'll never use again. On a 500 MB/s SSD, that's 6+ minutes of pure I/O — and we threw away most of the work.
This is the trap: external merge sort produces an artifact (a fully sorted file) that is far more powerful than the question asked for. We don't need every element in order. We need one element — the one at rank N/2. Paying to sort all 9,999,999,999 of the others is wasted work.
Here is the single insight that unlocks everything: the median is defined purely by its position in the sorted list, not by the sort itself. You only need to find the value v such that exactly N/2 values in the dataset are ≤ v. You don't need to know what's to the left of it. You don't need to know what's to the right of it. You just need to know where the boundary sits.
Think of it like this analogy. Imagine you ask 10 billion people their height and want to find the median height. You don't need to line them all up in a row. You can just ask each person to step into the room labeled with their height-bracket — under 150 cm, 150–160, 160–170, and so on. Then you walk the rooms in order and count: "room 1 has 1.8 billion, room 2 has 2.0 billion, room 3 has 2.5 billion — that's 6.3 billion, we've crossed the halfway mark of 5 billion in room 3, the median is somewhere in this room."
You never sorted anyone. You just counted who landed where. That's the entire trick.
A 32-bit signed integer has a fixed, finite range: from −2³¹ (about −2.15 billion) to 2³¹ − 1 (about +2.15 billion). That's roughly 4.3 billion possible distinct values. Bounded range is the magic word — it means we can pre-define a fixed grid of buckets that covers every possible value, no surprises.
Here is the picture. Carve up the entire integer range into 2¹⁶ = 65,536 equal-sized buckets. Each bucket is just a counter — an entry in an array. When we read a number v from disk, we figure out which bucket it belongs to, and increment that counter. That's it. We never store v itself.
A 32-bit integer is just two halves stitched together: 16 high bits and 16 low bits. The high bits tell you the coarse position of the number (which broad bracket it's in); the low bits tell you the fine position within that bracket. If you split the value range into 65,536 equally-sized buckets, then which bucket a number falls into is determined entirely by its high 16 bits.
The bucket index is just one CPU instruction:
// for unsigned thinking; we'll handle signed offset in a moment int idx = (v >>> 16) & 0xFFFF; // high 16 bits, value in [0, 65535] buckets[idx]++;
65,536 buckets × 8 bytes per long counter = 512 KB. That fits in your CPU's L2 cache, never mind your RAM. The 16 GB constraint is a non-issue.
One bit shift, one mask, one array increment. We can process numbers as fast as the disk can hand them to us. The job is now I/O-bound, not CPU-bound.
Pass 1 is a single sequential read of all 50 files. For every integer v we see, we increment buckets[high16(v)]. At the end of the pass, the array tells us how the 10 billion values are distributed across 65,536 brackets.
Now we go hunting for the median. We walk the buckets in order, keeping a running sum. The moment that sum first reaches N/2, the bucket we just stepped into is the one that contains the median.
long target = N / 2; long cumulative = 0; int medianBucket = -1; long countBefore = 0; for (int i = 0; i < 65536; i++) { if (cumulative + buckets[i] >= target) { medianBucket = i; countBefore = cumulative; // values strictly before this bucket break; } cumulative += buckets[i]; }
Two things come out of this loop, and we'll use both in Pass 2:
medianBucketThe index of the bucket containing the median value. We now know the median's coarse location — the high 16 bits of its value are pinned down.
countBeforeHow many of the 10 billion numbers landed in earlier buckets. The median is the (target − countBefore)-th smallest within medianBucket.
A small detail that trips people up. With signed 32-bit ints, the bit pattern 0x80000000 is the most-negative number, and 0x7FFFFFFF is the most-positive. If you bucket by raw high-16-bits, the negatives land at the top of the array, not the bottom — which would silently break the "walk in order" loop above.
The fix is a one-liner at read time: add 2³¹ to every value (or equivalently, XOR the sign bit). That maps the signed range [−2³¹, 2³¹−1] to the unsigned range [0, 2³²−1], preserving order. Now bucket index 0 holds the most negative numbers, bucket 65,535 holds the most positive, and the cumulative walk works correctly. We undo the offset only at the very end when we report the answer.
Pass 1 narrowed the median to one of 65,536 brackets. Each bracket is 2¹⁶ = 65,536 values wide. We now do the same trick again — but this time we only care about values that land in medianBucket, and we sub-bucket them by their low 16 bits.
It's the same algorithm, scoped down. We allocate another 65,536-entry counter array (sub[]). We re-stream all 50 files. For each integer, we filter — if its high 16 bits don't match medianBucket, we ignore it. Otherwise, we extract its low 16 bits and increment sub[low16(v)].
long[] sub = new long[65536]; for (int v : streamAllFiles()) { int hi = (v >>> 16) & 0xFFFF; if (hi == medianBucket) { int lo = v & 0xFFFF; sub[lo]++; } }
After Pass 2, sub[] tells us how many copies of every individual value in medianBucket exist in the dataset. We do one last cumulative walk:
long needRank = target - countBefore; // rank within medianBucket long cum = 0; int medianLow = -1; for (int i = 0; i < 65536; i++) { cum += sub[i]; if (cum >= needRank) { medianLow = i; break; } } int median = (medianBucket << 16) | medianLow; // stitch back together
Two 16-bit halves combine to give the exact 32-bit median value. Done — in two sequential reads of the data, with under 1 MB of RAM total.
Algorithms feel abstract until you watch them eat real numbers. Let's shrink the problem so the whole thing fits on a screen.
Pretend our integers are 4 bits wide (range [0, 15]) and we have N = 16 of them across two files. We'll use 4 buckets in each pass instead of 65,536, but the math is identical.
3, 1, 4, 1, 5, 9, 2, 65, 3, 5, 8, 9, 7, 9, 316 → target rank = N/2 = 8 (the 8th smallest)1,1,2,3,3,3,4,5,5,5,6,7,8,9,9,9 → answer should be 5With 4 buckets, each covers 4 values. The "bucket index" of a number is just v >> 2 (the high 2 bits of a 4-bit number).
| Bucket | Range | Values that land here | Count | Cumulative |
|---|---|---|---|---|
0 | 0–3 | 1, 1, 2, 3, 3, 3 | 6 | 6 |
1 | 4–7 | 4, 5, 5, 5, 6, 7 | 6 | 12 ← crosses 8 here |
2 | 8–11 | 8, 9, 9, 9 | 4 | 16 |
3 | 12–15 | — | 0 | 16 |
Cumulative count first reaches the target (8) at bucket 1. So:
16 (elements in earlier buckets)target − countBefore = 8 − 6 = 2 → the median is the 2nd smallest within bucket 1Re-stream both files. Filter to values in bucket 1 (range 4–7). Increment sub[v & 0b11] for each.
| Sub-bucket (low 2 bits) | Actual value | Count | Cumulative |
|---|---|---|---|
0 | 4 (bucket_base + 0) | 1 | 1 |
1 | 5 | 3 | 4 ← crosses 2 here |
2 | 6 | 1 | 5 |
3 | 7 | 1 | 6 |
Cumulative first reaches needRank = 2 at sub-bucket 1. The actual value is (medianBucket << 2) | 1 = (1 << 2) | 1 = 5. ✓
Now that the shape of the algorithm is clear, let's price it out for the real 40 GB problem.
| Resource | External merge sort | Bucket counting (this article) |
|---|---|---|
| Peak RAM | Tunable, typically 1–4 GB for buffers | ~1 MB (two 65k-entry long arrays) |
| Disk reads | ~120 GB (chunk-read + merge-read + walk) | ~80 GB (two streaming passes) |
| Disk writes | ~80 GB (sorted runs + final sorted file) | 0 — no intermediate output |
| CPU per element | O(log N) inside heap-merge | O(1) — shift, mask, array increment |
| Time on a 500 MB/s SSD | ~7 minutes (I/O-dominated) | ~3 minutes |
| Parallel-friendly? | Sort phase yes; merge phase serial | Yes — both passes embarrassingly parallel |
At 40 GB, every second of wall-clock time is dictated by how many bytes you move. A modern NVMe SSD reads at 3–5 GB/s sequential; a SATA SSD at 500 MB/s; a network-attached blob store at maybe 200 MB/s. The bucket method's win isn't that it's smarter — it's that it does less I/O and the I/O it does is purely sequential, which is the fastest mode on every storage medium.
Three rules of thumb that turn this from "an algorithm" into "a program that finishes":
mmap lets the kernel page in file pages on demand and skip a userspace copy. For sequential scans of bounded-size files it's often the single biggest win.The two-pass story is the headline. The rest of this section is the small print that separates a good answer from a great one.
If N is even, the median is conventionally defined as the average of the two middle values (rank N/2 and N/2+1). Track both ranks during the cumulative walk. Often they fall in the same final sub-bucket — answer is that value. If they fall in different sub-buckets, it's the average of the two values.
You can't bucket the full 64-bit range in one shot — too many buckets. Solution: do four passes of 16 bits each (high → mid-high → mid-low → low). Or 3 passes with 21-bit slices. Same algorithm, just one more zoom level. RAM is still kilobytes per pass.
What if Pass 1 dumps 6 of the 10 billion values into a single bucket (e.g., almost all zeros)? Pass 2 still works — the sub-bucket array doesn't grow with input size, only with bucket width. But if even one sub-bucket is huge, you simply recurse: 16 bits → 8 bits → 8 bits. The algorithm is self-correcting — never run out of zoom levels until you're naming a single value.
Mentioned in §5: signed integers don't bucket in natural order without a fix. Add 2³¹ at read time to map [−2³¹, 2³¹−1] → [0, 2³²−1], preserving order. Subtract it back at the end. One arithmetic op per element — basically free.
Trivially handled. The bucket counter just increments more often. Counting doesn't care about distinctness.
Pass 1 has zero cross-thread state. Each thread keeps its own 512 KB bucket array; at the end of the pass, sum them element-wise. With 50 files and 16 cores, you're disk-bound long before you're CPU-bound — the algorithm scales with how fast you can pull bytes off storage.
target = N/2 with target = N * p for the p-th percentile. Same two passes. Useful when you need P95, P99, etc. on huge logs.The bucket method has one hard prerequisite: the values must come from a bounded, discretizable range that you can carve into buckets ahead of time. When that breaks, you need a different tool. Three real-world cases:
Floats can be reinterpreted as integers (the IEEE 754 bit pattern is monotonic for non-negative floats), so you can apply the same trick — but the math is fiddly: handle NaN, ±0, denormals, and flip bits for negatives. Doable, but most teams give up exact answers and use approximation.
If values can be arbitrarily large, you can't pre-allocate a bucket grid. Fall back to quickselect-on-disk with sampled pivots, or external merge sort, or sketches.
Sometimes you can't make two passes — the data is a live stream. You give up exact median for an approximate one. t-digest, P², and HDR Histogram are the standard tools. Each gives you any percentile, with bounded error, in fixed memory, in one pass.
If your business doesn't actually need the exact median to the integer — just "around 5,000-ish, ±1%" — random reservoir sampling is the simplest correct answer. Sample 10 million numbers uniformly, sort the sample in RAM, take the middle. Ten lines of code, near-perfect accuracy for typical distributions.
If someone stops you in the hallway and asks you to explain the whole article in a minute, here's the script.
The problem. 10 billion 32-bit integers across 50 files, 40 GB of data, 16 GB RAM. Find the median. Sorting is wasteful — we don't need the order, just the value at rank N/2.
The trick. A 32-bit integer's range is bounded. Carve it into 65,536 brackets (one per high-16-bits value) and just count how many numbers land in each bracket. Counting is O(N) and needs ~512 KB of RAM — trivial.
Pass 1. Stream all 50 files once. For each int, increment buckets[high16(v)]. Walk the bucket array cumulatively — the bucket where the running sum first crosses N/2 contains the median.
Pass 2. Re-stream the files. Only count values whose high 16 bits match the median bucket. Sub-bucket them by their low 16 bits. Walk the sub-array cumulatively. The sub-bucket where the count crosses (N/2 − countBefore) is the exact median value. Stitch the high and low halves back together — done.
The cost. Two streaming reads of the data (~80 GB I/O), under 1 MB of RAM, no intermediate writes, embarrassingly parallel across files.
The general idea. When you only need a positional answer (median, percentile, k-th smallest) over a large dataset with a bounded value range, replace sorting with counting. Counting fits in cache. Sorting fits in pain.