Why does memory bandwidth matter more than FLOPS?
FLOPS measure how fast a chip can do math. Many AI workloads are waiting on the weights, cache, and activations to arrive.
For inference, the binding constraint often shifts from compute to memory bandwidth. The chip with the biggest FLOPS number is not automatically the chip with the best token economics.
The model has to be read before it can answer
Every generated token requires the accelerator to touch model weights, activations, and often a growing key-value cache. The math is fast once the bytes are in the right place. The wait is moving those bytes through the memory hierarchy.
This is the memory wall. It shows up whenever the chip has enough arithmetic capacity but cannot feed it quickly enough.
Prefill and decode stress different machines
Prefill processes the prompt in parallel. It produces large matrix multiplies that keep tensor cores busy. Decode generates one token at a time. It produces skinny operations wrapped around memory reads.
That is why a long-context chat can feel expensive even when the model is not doing much new reasoning. The system is repeatedly reading state to produce one more token.
HBM turns into the price of a token
High-bandwidth memory is scarce, expensive, and physically close to the compute die. In inference, its bandwidth can set tokens per second and therefore cost per useful answer.
A useful rule: if the workload is decode-heavy, read the HBM bandwidth line before the peak FLOPS line. If the workload is prefill-heavy, FLOPS start to matter again.
The two-lane race that sets the floor
A clean way to picture every decode step is a two-lane race. One lane is doing the math. The other lane is moving the bytes the math needs. The step is finished only when the slower lane crosses the line. The total time is the maximum of the two, not the sum.
For one decoded token across a batch of users, the math lane runs in proportion to the number of users times the active parameters. The memory lane runs in proportion to the size of the model plus a small per-user cache. The memory floor is set by the time it takes to read every parameter in HBM exactly once: on a B200 at 8 TB/s with a 140 GB FP16 model, that is roughly 18 ms; on a 70 B FP16 model (~140 GB) it is roughly the same; on a smaller 7 B FP16 model (~14 GB) it falls to about 2 ms. The floor scales linearly with model size divided by HBM bandwidth. Even at a batch of one user, the per-token floor is the same. You cannot pay your way under it.
This is the answer to the question that started the chapter. Faster math without faster memory does not make decode faster. The slower lane is already setting the clock. New chip generations that double FLOPs without doubling bandwidth move the wrong lane.
Source: Reiner Pope, blackboard inference economics on Dwarkesh Podcast, 2025
Software attacks the memory pattern
FlashAttention, paged attention, grouped-query attention, speculative decoding, quantization, and cache compression all share one motive: move fewer bytes, move them more predictably, or reuse them before they leave fast memory.
The algorithmic win and the chip win are joined. Better kernels make the same HBM go further. Better HBM lets the same model serve more tokens before the rack hits its economic limit.
The bottleneck rotates back into models
Once memory bandwidth becomes the constraint, model architecture changes. Smaller active parameter counts, mixture-of-experts routing, lower precision, and retrieval patterns are not just research ideas. They are ways to fit intelligence into the memory budget.
The memory wall is where chips stop being a hardware topic and become a model-design topic.