How does memory hierarchy work in an AI chip?
Every level of memory trades capacity for latency. A modern accelerator runs the same model through SRAM, HBM, DRAM, NAND, and HDD at different points in its life.
AI economics are set by where the bytes sit. The closer they are to the compute, the more tokens per second the chip can produce, and the more expensive that storage layer becomes per gigabyte.
A computer is three pillars
It helps to keep a simple frame in mind. A computer does three things: it stores data, it moves data, and it calculates on data. AI hardware is no different. The accelerator is the calculation pillar, the fabric and packaging are the movement pillar, and the storage pillar is a layered ladder that starts on the die and ends in a spinning disk on a different floor of the building.
Every choice in chip design is a trade across these three pillars. Pay more for storage near the compute and the chip gets faster but bigger. Pay less and the chip waits more.
The five layers in order
- SRAM. On the same die as the compute. Tens of megabytes per die at most. Fastest, most expensive, most area-hungry per bit.
- HBM. Stacked DRAM dies sitting on the same package as the compute, connected over a thousand-plus-wire bus. Hundreds of gigabytes per accelerator. The current bandwidth ceiling for inference.
- DRAM. Conventional DDR or LPDDR on the motherboard. Hundreds of gigabytes to a few terabytes per server. Slower than HBM, much cheaper.
- NAND flash. Solid-state drives in the rack. Tens of terabytes per server. Non-volatile, so the data survives power loss. Used for checkpoints, datasets, working state.
- HDD. Spinning platters in a separate storage pod. Petabytes per rack. Slowest, cheapest per terabyte, where cold training data and archived weights sit.
The same model touches every layer
During training, weights start on HDD, get pulled into DRAM, get pushed through HBM into SRAM tiles, get multiplied against activations that also live in SRAM, then the updated gradients walk back out. Every step has its own physics and its own cost.
During inference, the picture is starker. The weights have to be re-read from HBM for every generated token. That read is the lower bound on tokens per second per chip. The smaller the model in HBM bytes, the cheaper the token.
Why this map is the basis for the rest
The other compute explainers all touch this hierarchy somewhere. The memory wall page is about the HBM bandwidth floor on decode. The Cerebras and Taalas chip designs are bets about how much SRAM can replace HBM. The HBM4 vendor war is about who supplies the second layer. The NAND and HDD layers decide how much data a lab can hold ready to train.
The storage pillar is not a side concern in AI infrastructure. It is half of why the chip costs what it costs and runs at the speed it runs.
Strategic read
Treat the storage hierarchy as a budget. Every layer has a price per gigabyte and a price per byte-moved. Architectural decisions are choices about where to spend that budget.
When the bottleneck rotates, it rotates within this stack. Today it sits at HBM bandwidth. Tomorrow, if quantization and mixture-of-experts shrink the active footprint, it may rotate to SRAM area or to NAND throughput for very long context workloads.