Storage · 2 of 8

What is SRAM, and why does an AI accelerator have so little of it?

SRAM is the fastest memory on a chip and the most expensive per bit. Every accelerator design is a fight over how much of the die to spend on it.

Where the binding constraint sits today

SRAM is what makes the math fast, but it is too expensive to scale. The architectural question is how much of the working set you can keep on-die before you have to spill into HBM.

A bit made of six transistors

Static RAM stores each bit in a small loop of six transistors. The loop holds the value as long as the chip is powered. There is no capacitor to refresh and no off-chip read to wait for.

That is why SRAM access takes a fraction of a nanosecond. The cost is area. Six transistors per bit is roughly an order of magnitude more silicon than a DRAM cell, which is one transistor and one capacitor.

Where SRAM lives on an accelerator

On a modern GPU or TPU, SRAM appears as register files inside each compute unit, L1 and L2 caches close to the tensor cores, and a larger shared scratchpad. The total budget is usually a few tens of megabytes per die.

When a kernel runs, the goal is to keep its working tile inside SRAM. Once the tile spills out to HBM, the math lane has to wait for the memory lane, and the chip stops being a tensor factory.

The Cerebras and Taalas bets

A standard accelerator design accepts the HBM round-trip and spends its area on more compute. Cerebras took the opposite bet with wafer-scale: keep the entire model inside on-die SRAM and skip HBM altogether. That requires a wafer-sized die and a lot of SRAM area.

Taalas burns most of the weights directly into the upper mask layer of the chip and uses a smaller SRAM region for the fine-tunable subset. According to Irrational Analysis, roughly two-thirds is hard-coded and one-third is SRAM that supports LoRA-style updates.

Source: Irrational Analysis interview, Chris Barber, May 2026

Software fights for SRAM residency

FlashAttention, fused kernels, paged attention, and tile-aware schedulers all exist mostly to keep more of the workload in SRAM for longer. Every kernel that turns a memory-bound step into a compute-bound step has done this.

That is why kernel work and chip design are tied. Better SRAM management makes the same silicon look like a faster chip.

Strategic read

SRAM is the constraint that hides behind the FLOPS number. A chip with 10 petaFLOPS and 20 megabytes of SRAM can lose to a chip with 7 petaFLOPS and 60 megabytes on a memory-bound workload.

Architectural arguments about wafer-scale, hardcoded-weights ASICs, and on-package memory are all variations on the same question: how do we get more of the model inside the fastest layer of storage?