The Inference Shift: Prefill, Decode, and the End of Waiting

The AI chip race to date has been remarkably simple: build the fastest compute paired with the fastest memory, and sell it to labs training frontier models or companies serving chat interfaces to humans. The constraint was always time. A human waiting for a cursor to move will abandon the product if it takes more than a few hundred milliseconds.

This has led to a hardware architecture optimized for latency. We paired massive GPUs with expensive High-Bandwidth Memory (HBM). But as AI shifts from chatbots answering questions to agents executing long-horizon tasks autonomously, the economic and structural logic of inference flips.

When no human is watching, latency stops mattering. Capacity becomes the binding constraint. This shift fundamentally changes how we think about the mechanics of Large Language Models (LLMs) and the hardware that runs them.

1. The Mechanics: Prefill vs. Decode

To understand why the infrastructure is changing, we must first abstract the mechanics of how an LLM actually generates a response. The process is split into two distinct phases: Prefill and Decode.

Prefill: The Parallel Compute Phase When you send a prompt, the system prompt, and all historical context to the model, it must turn that entire block of text into numbers and understand the relationships between them. This is the prefill stage. It involves massive, parallel matrix multiplication. The model walks the prompt through every layer of weights once, in parallel across all input tokens, and writes the intermediate keys and values into what is called the KV Cache (Key-Value cache), a memory bank of what it just read.

Constraint: Prefill is compute-bound. It requires massive computational horsepower to do the math quickly, but it fully utilizes the GPU's processing cores.

Decode: The Serial Memory Phase Once the prompt is digested, the model begins generating the answer, token by token (or word by word). To generate the next token, the model must read two things: the model weights (the "brain") and the growing KV cache (the "memory" of the conversation). It does this over and over, sequentially.

Constraint: Decode is memory-bandwidth bound. It is not limited by how fast the GPU can do math; it is limited by how fast it can fetch the KV cache and weights from memory. The compute cores sit idle most of the time, waiting for data to arrive.

Today, we hide this inefficiency by putting everything in HBM—a specialized memory physically stacked next to the GPU that is incredibly fast (terabytes per second) but astronomically expensive and strictly limited in capacity (e.g., 80GB on an H100).

2. The Split: Answer Inference vs. Agentic Inference

This dynamic forces a fork in the road for how we deploy models.

Answer Inference is what we do today. A human asks a coding copilot a question and waits for the code. Speed is the product. We buy a frontier GPU and pair it with expensive HBM. During prefill, we waste memory bandwidth. During decode, we strand compute. Public measurements of decode utilization put tensor cores in the 20-40% range, and at small batch sizes effective tensor compute utilization can drop below 5% — most of the silicon waits on memory. We accept this waste because the human will pay a premium for low latency.

Metaphor: We are building Ferraris. They are incredibly expensive, they spend a lot of time idling at red lights, but when the light turns green, they get the human to their destination instantly.

Agentic Inference is the future. An AI agent is given a goal—"refactor this codebase," "research these 50 companies," or "run these simulations"—and it works overnight. There is no human watching. Speed per token no longer matters.

Agents get better by remembering more: keeping entire company wikis, tool logs, and massive conversational histories in their context window. This means the KV cache explodes. A million-token context cannot fit into the HBM of a single GPU. If we insisted on using HBM for agents, we would need thousands of dollars of memory per agent.

Metaphor: The agent era wants freight trains. They move slowly, but they have giant boxcars of cheap capacity, and they run overnight.

3. The Memory Tiering Shift

If latency isn't the constraint, you can trade speed for capacity. This leads to a fundamental redesign of the AI hardware stack around memory tiering.

Instead of keeping everything in ultra-fast, ultra-expensive HBM, systems will distribute the memory based on how quickly it is needed:

The Hot Tier (HBM): Only the current layer's weights and the immediate few thousand tokens the model is actively looking at. (Fast, expensive).
The Warm Tier (DRAM): The rest of the recent KV cache. (~30x slower, but 10x cheaper per GB).
The Cold Tier (NVMe SSD / Flash): Older conversation history, tool outputs, and embeddings. (~500x slower, but 100x cheaper per GB).

The Library Metaphor: Think of this like your desk versus a warehouse. Today, for Answer Inference, we keep the entire library stacked on your desk (HBM) so you can grab any page instantly. It is fast, but your desk is small and the rent is high.

For Agentic Inference, you keep the book you are currently reading on your desk (HBM). You keep the current chapter in a nearby cart (DRAM). You put the rest of the library in an off-site warehouse (SSD/Flash). When you need a page from the warehouse, it takes much longer to fetch it. But because no one is watching the clock, you can finish a massive research job for a fraction of the cost.

The Economic Upside At planetary scale, moving more stuff slowly beats moving a little stuff instantly. If you have a fixed budget of $1 Billion, you could buy a small fleet of frontier GPUs with HBM, offering ultra-low latency to a few agents at a time. Or, you could buy a massive fleet of older-generation GPUs (or even CPUs) paired with petabytes of cheap DRAM and SSDs.

Each token takes longer to generate because the system is constantly fetching data from slow storage. But because you can run millions of agents in parallel, your total throughput—your tokens per dollar—skyrockets.

The frontier GPU stack will remain dominant for training new models and for real-time consumer apps. But the largest market by far will be the agentic market, won by whoever masters the orchestration of cheap, deep memory hierarchies. The way we get more compute is by realizing the compute we have is already good enough; we just need more room to remember.

Sources

"Prefill Is Compute-Bound. Decode Is Memory-Bound. Why Your GPU Shouldn't Do Both." — Towards Data Science. Source for the 20-40% tensor utilization range during decode.
Rajan Sethi, "Why Your LLM Is Wasting 90% of Its GPU." — Source for the small-batch decode utilization figures.
Pierre Lienhart, "LLM Inference Series 5: Dissecting model performance." — Background reference on prefill/decode mechanics and the arithmetic intensity argument.

Sources

"Prefill Is Compute-Bound. Decode Is Memory-Bound. Why Your GPU Shouldn't Do Both." — Towards Data Science. Source for the 20-40% tensor utilization range during decode.

Rajan Sethi, "Why Your LLM Is Wasting 90% of Its GPU." — Source for the small-batch decode utilization figures.

Pierre Lienhart, "LLM Inference Series 5: Dissecting model performance." — Background reference on prefill/decode mechanics and the arithmetic intensity argument.

The Inference Shift: Prefill, Decode, and the End of Waiting

1. The Mechanics: Prefill vs. Decode

2. The Split: Answer Inference vs. Agentic Inference

3. The Memory Tiering Shift

Sources

The Mercurial Muse

When the Workload Stops Moving: Why Custom Silicon Finally Pays

What is AI actually doing for people?

The Inference Shift: Prefill, Decode, and the End of Waiting

1. The Mechanics: Prefill vs. Decode

2. The Split: Answer Inference vs. Agentic Inference

3. The Memory Tiering Shift

Sources

The Mercurial Muse

When the Workload Stops Moving: Why Custom Silicon Finally Pays

What is AI actually doing for people?