Hello world.One forward pass, from byte to silicon.
Most transformer explainers stop at the math. Most chip explainers stop at the datasheet. This is a single end-to-end walk of what actually happens between the moment you type a prompt and the moment the first new token arrives. Ten beats, each with the math and the silicon side by side.
The prompt arrives
You type Hello world into a chat box and hit enter. What reaches the model is not a string. It is eleven bytes, UTF-8 encoded, sitting in your browser's memory, then on a web server, then eventually in the host RAM of a machine somewhere in a data centre next to a rack of GPUs.
The model lives on the GPU, not on the host. Every forward pass begins with the same dull administrative step: move the bytes over PCIe from the host into HBM, the high-bandwidth memory stacked in the same package as the compute die. This transfer is the slowest hop in the whole journey, and it is the one most people never hear about.
Tokenisation
Before the bytes can become tensors, they pass through a tokeniser. Byte-pair encoding looks at the text through a pre-learned vocabulary, typically around 128,000 to 200,000 entries, and chops the string into the fewest whole-token pieces it can. For most modern frontier models, Hello world resolves to two tokens: one for Hello, one for world with the leading space included.
Every token in the vocabulary has an integer ID. So the whole input collapses to a tiny list of integers. For us, something like [9906, 1917]. That is what gets loaded onto the GPU. The bytes are gone. From here on, everything is math on those numbers.
The embedding lookup
A token ID on its own is meaningless. The model's first job is to translate each integer into a high-dimensional vector that carries some learned notion of what that token means. This is the embedding table: a giant matrix of dimensions vocabulary by model-width, typically around 150,000 rows and 4,096 columns. Around 600 million floats, sitting in HBM the whole time the model runs.
The operation itself is a gather. Token ID 9906 means "take row 9906". No multiplication, no accumulation, nothing clever. Just read 4,096 floats out of HBM and place them in the model's working buffer. Zero arithmetic. Pure memory.
This sounds trivial, and at two tokens it is. But it introduces a pattern that will recur across the entire forward pass: some operations are gated by how fast the chip can compute, and others are gated by how fast it can read memory. Embedding lookup is the purest version of the second kind.
Position, rotated in
Transformers process all tokens in parallel. That parallelism is a feature, not a side effect, and it is the reason attention is expressive in the first place. But it creates a problem: if every token is processed at the same time, how does the model know that Hello came before world and not the other way around?
The answer used to be additive positional encodings. Modern models use rotary position embedding, usually called RoPE, which takes each pair of dimensions in the embedding vector and rotates that pair by an angle proportional to the token's position. Early tokens get small rotations, later tokens get larger ones, and the geometry of attention naturally ends up comparing tokens based on their relative positions.
On the silicon side, this is cheap. A handful of multiplies and adds per dimension, fused into the attention kernel. It shows up in profilers as a thin sliver, rarely the bottleneck in anything.
Attention, the math
Attention is the one step that made this whole architecture worth building. Each token's embedding gets projected three ways, into a query vector Q, a key vector K, and a value vector V. Three learned weight matrices, three matmuls. The outputs live on the same chip, typically split across dozens of heads so the model can learn different kinds of relationships in parallel.
The heart of the operation is a single expression: softmax of Q times K transposed, divided by the square root of the head dimension, then multiplied by V. In words, every token asks every other token "how relevant are you to me", gets a distribution back, and uses that distribution to pull a weighted average of everyone's values. The output is a new embedding per token, now coloured by context.
That is the math. The math is not the bottleneck.
Attention, the silicon
On the chip, attention is four matmuls stacked on top of each other. The QKV projections, the Q-by-K-transposed product, and the attention-weighted V sum. Every matmul is the same shape of problem: dot products between rows and columns, thousands of them, all independent, all add-multiplies.
This is exactly what tensor cores were built for. A tensor core is a small systolic array, a grid of multiply-accumulate units wired so that activations flow in from the left, weights flow in from the top, and partial sums accumulate down and out the right edge. Data moves one cell per clock cycle. By the time the last input has entered the array, the first output has already left. The grid is never empty; it is always doing useful work.
The pattern matters. A naive matmul on a generic processor reads each weight many times, each read another round-trip to memory. A systolic array reads each weight once, then reuses it as it streams through the grid. That reuse is the entire reason tensor cores beat CUDA cores by roughly eight-to-one on FP16 matmul. Not faster math, fewer trips to memory.
The KV cache, and why it gets big
During attention, the model computes K and V for every token in the sequence. On the very first forward pass, there are only two tokens, so K and V are tiny. But as generation proceeds, every new token needs to attend back to every previous one, which means every previous K and V must be available.
Rather than recomputing them every step, modern inference engines cache them in HBM. This is the KV cache. Its size scales with sequence length, number of layers, number of heads, and head dimension. For a 70-billion-parameter model at 8,000 tokens of context, the cache can easily exceed 20 gigabytes per sequence. At 100,000 tokens it becomes the dominant consumer of HBM, often larger than the model weights themselves.
This is where the memory hierarchy starts to bite. SRAM, the fastest tier of on-chip memory, is measured in tens of megabytes. HBM is measured in tens to hundreds of gigabytes. Anything that does not fit in SRAM must be streamed from HBM, which costs bandwidth. A long-context decode is essentially the model saying "read the whole cache, compute a skinny matmul, write one new token, repeat".
The feed-forward network
After attention comes the largest chunk of compute in a transformer block, and the one most people never think about. A two-layer MLP applied independently to every token. The first layer expands the embedding from the model's hidden dimension, typically 4,096, up by a factor of roughly four, to 16,384. A non-linearity, usually a gated variant like SwiGLU. Then a second layer projects back down to 4,096.
Two matmuls per token, enormous ones. Most of the parameters in a transformer live here. In a 70-billion-parameter model, the MLP weights account for roughly two-thirds of the total. Whatever the embedding brings in, the feed-forward network reshapes, and it does so token by token with no cross-token communication.
Compute-wise, this is heaven for tensor cores. The matrices are large and square-ish. The arithmetic intensity, meaning the ratio of flops per byte of memory traffic, is high. Utilisation can approach 70 percent of peak on well-tuned kernels. It is the part of the forward pass where the hardware gets to do what it was built to do.
Sampling
After many transformer blocks, typically between 32 and 96 depending on the model, the last block's output gets one more projection. This time, back out to vocabulary size. A matmul of the final embedding against the transposed embedding table, producing a logit for every possible next token.
Softmax turns logits into probabilities. Then the model samples. Greedy sampling picks the top token. Temperature sampling rescales the distribution before picking. Nucleus sampling truncates to the smallest set whose cumulative probability exceeds some threshold and picks from there. The choice of sampler is a knob, not a breakthrough.
The sampled token is a single integer. That integer is the model's output for this step. It is also the input to the next step.
Pre-fill and decode are not the same chip
The first forward pass processed both input tokens in parallel. Every matmul was large and square. Tensor cores were busy. This regime is called pre-fill. It is compute-bound, which is to say the bottleneck is how fast the chip can multiply and accumulate.
Starting on the second token, the model only ever processes one new token at a time. The input to each matmul is now a row, not a block. The matmul is skinny. The tensor cores finish their tiny share of work in almost no time. But the KV cache, now containing every previous token's K and V, must be read from HBM in full for every layer at every step. Megabytes or gigabytes of memory traffic to produce one token.
This is decode. It is memory-bound. The same chip that ran at 70 percent tensor-core utilisation during pre-fill often drops below 10 percent during decode. The limit is not FLOPs; it is bandwidth.
Almost every inference optimisation published in the last three years, from FlashAttention to paged attention to grouped-query attention to KV quantisation, targets this asymmetry. Not the math. The memory pattern around the math.
What this chapter is trying to change
There are excellent explainers for each piece of this. Mathematical walkthroughs of attention exist. Hardware teardowns of the H100 exist. Papers on FlashAttention and paged inference exist. What none of them do is walk a single prompt end-to-end, with the math and the silicon both visible at every step, so that the reader leaves with a unified picture.
The point of that unification is decision-grade intuition. If you understand that decode is memory-bound and pre-fill is compute-bound, you understand why long-context inference is expensive, why batching matters more for decode than pre-fill, and why the whole industry is optimising the memory pattern around attention rather than the attention math itself. Those are not separate facts. They are the same fact, seen from different angles.
This is chapter one. Subsequent chapters will go deeper on specific pieces: the geometry of attention in high dimensions, how quantisation reshapes the memory story, what mixture of experts does to the compute pattern, and what changes when you move from dense transformers to state-space models.