Storage · 5 of 8

What is HBM, and why does it set the price of a token?

High-bandwidth memory stacks DRAM dies vertically over a logic base die and connects them to the compute with a very wide bus. It is the layer that decides how fast tokens come out of an accelerator.

Where the binding constraint sits today

HBM bandwidth and capacity sit between every model and every token it produces. The HBM vendor map and the base-die strategy are now strategic surfaces, not procurement details.

Stacks of DRAM with a logic basement

An HBM stack is four, eight, or twelve DRAM dies stacked on top of each other, connected through silicon vias that run vertically through the stack. At the bottom of the stack sits a base die, which is a small piece of logic that talks to the compute chip on the package.

The whole assembly is then attached to the accelerator through an advanced package such as CoWoS or EMIB, sitting close enough that the data path is short and wide.

HBM package, in cross-section

HBM stack

logic

8+ DRAM dies

1024-bit bus

GPU

compute die

the processor

Silicon interposer

Package substrate

Stack, don't spread

Eight or more DRAM dies in a tower fit a huge memory into a tiny footprint next to the GPU.

Drill straight down

Through-silicon vias are copper columns punched through every die, carrying signal and power between floors.

Bridge, millimeters wide

The interposer carries the 1024-bit bus the short distance that a bus that wide demands.

Bandwidth comes from width, not clock

HBM does not run faster than DDR per pin. Its win comes from how many pins it uses at once. A single HBM stack exposes more than a thousand data lines into the compute die.

That width is why HBM bandwidth is measured in terabytes per second on a single accelerator while board-level DDR is measured in tens of gigabytes per second.

3.35 TB/s

H100 HBM bandwidth

8.0 TB/s

B200 HBM bandwidth

The HBM4 base-die war

Before HBM4, every vendor built the base die on their own internal DRAM process. Cheap because they owned the fab, but the transistors were poor for logic. With HBM4, the strategies split.

SK Hynix moved the base die to a TSMC 12-nanometer-class logic process. Samsung uses its in-house SF4X node, which Irrational Analysis rates as roughly between TSMC N6 and N7. Micron stuck with its internal DRAM process, which is delaying its HBM4 qualification.

The read: even if NVIDIA rejects Micron HBM4 for Rubin, Micron sells the wafers as regular DRAM at very high prices, so the financial damage is small. But the technical leadership story is now SK Hynix first, Samsung close, Micron behind.

Source: Irrational Analysis interview, Chris Barber, May 2026

Packaging eats half the supply chain

Putting HBM stacks next to a compute die is not a trivial step. It requires advanced packaging slots, mostly at TSMC CoWoS today, with Intel EMIB now ramping as an alternative.

That is why the HBM constraint and the packaging constraint show up together. A vendor that has HBM allocation but no CoWoS slot still cannot ship.

Dodging the HBM Crisis: LPDDR5X as an Alternative

The severe global HBM3e and HBM4 supply crunch, coupled with extreme bottlenecks in TSMC’s CoWoS advanced packaging capacity, has forced hardware designers to look beyond the standard packaging stack. Intel’s upcoming data center accelerator, Crescent Island—powered by their Xe3P GPU architecture—makes a major tactical play: it completely bypasses HBM in favor of board-level LPDDR5X (Low-Power Double Data Rate 5X) memory.

Instead of stacking DRAM vertically with complex through-silicon vias (TSVs) on a silicon interposer next to the GPU, Crescent Island solders 20 LPDDR5X memory packages (12 on the front, 8 on the back) flat on a standard motherboard PCB surrounding the socket. This achieves a massive 160GB capacity without using a single CoWoS packaging slot or HBM wafer allocation, letting Intel manufacture and ship accelerators at scale. The 1280-bit LPDDR5X-9600 interface delivers around 1.5 TB/s of aggregate bandwidth.

Source: Tom’s Hardware, VideoCardz coverage of Intel Crescent Island disclosure, Oct 2025–Apr 2026

The Strategic Play: Scale-Up vs. Scale-Out

To understand this pivot, one must examine the first-principles trade-off of AI system architecture: Scale-Up vs. Scale-Out.

Scale-Up (Vertical) binds multiple chips tightly together using ultra-high-speed custom interconnects (like Nvidia's NVLink or NVL72 rack) so they act as a single massive logical GPU. Because large LLMs are split across chips, modern tensor cores require immense HBM bandwidth (3.3 to 8.0 TB/s) to avoid being starved of data during the token generation cycle. If bandwidth drops, expensive compute cores sit idle, destroying efficiency.

Scale-Out (Horizontal) connects independent servers over standard networks. In high-concurrency cloud inference, the goal is running multiple independent model copies or handling thousands of concurrent user queries. With a massive 160GB of LPDDR5X memory per socket, the entire model fits on a single cheap board. Each token generates slower due to the lower memory bandwidth (~1.5 TB/s vs. 3.3–8 TB/s on HBM), but the systems are so cheap to build that you can deploy 4x to 5x more nodes for the same budget. The aggregate throughput (tokens per second per dollar) can be significantly higher, making it a highly profitable configuration for mass cloud inference.

SRAM (Static RAM): On-Chip (On-Die). Sits directly inside the logic cores. Tiny capacity (10s to 100s of Megabytes) due to 6T cell size, but ultra-high bandwidth and zero packaging complexity.
HBM (High Bandwidth Memory): Co-Packaged (On-Package). Stacked vertically (multilayer) with TSVs and placed next to the compute die on a silicon interposer. High bandwidth (3.3–8.0 TB/s) but extremely expensive and yield-constrained.
LPDDR5X (Low-Power DDR5X): Off-Package (On-Board). Laid out flat on a standard PCB motherboard surrounding the processor. Moderate bandwidth (~1.5 TB/s at LPDDR5X-9600 on a 1280-bit bus) with massive capacity (160GB+), cheap, and completely immune to advanced packaging bottlenecks.

The VR200 NVL72 Bill of Materials: Where the Money Goes

To understand who wins the AI capex race, one must look at the physical bill of materials (BOM) of a complete state-of-the-art server rack. According to Morgan Stanley estimates from 2026, a single NVIDIA Vera Rubin (VR200) NVL72 rack will cost approximately $7.8 million—nearly double the $3.5–4.0 million cost of the Blackwell (GB300) generation.

The breakdown reveals an immense shift in where hyperscaler spending goes. While Nvidia’s GPUs remain the largest cost in absolute terms, memory (HBM) represents the fastest-growing slice of the wallet. In the VR200 generation, HBM costs surge from $373,939 to over $2,001,600—a staggering 435% increase. Memory now captures more than a quarter of the entire rack’s cost, siphoning massive margin away from compute.

This shift changes the investment thesis for AI infrastructure. As hyperscalers scale up their spend, the companies that control memory, high-layer PCBs, and power management components are capturing an ever-larger share of the economic value flow.

HBM Memory (+435%): $373,939 → $2,001,600. Vera Rubin’s shift to 12-high and 16-high HBM4 base-die configurations makes memory a massive 25.6% of the rack’s BOM. Beneficiaries: SK Hynix, Samsung, Micron.
GPU Compute (+57%): $2,520,000 → $3,960,000. The GPUs remain the largest single ticket item, but Nvidia’s content share of the rack slightly declines as packaging and memory capture more margin. Beneficiary: NVIDIA.
Networking & Switch Chips (+121%): $325,800 → $720,000. Driven by optical connections and NVLink Switch chips to bind the rack together. Beneficiaries: Broadcom, NVIDIA, Marvell.
High-Density PCBs (+233%): $35,100 → $116,730. The transition to ultra-low-loss copper clad laminates (CCL) and high-layer count boards. Beneficiaries: Ibiden, Unimicron, Kingboard.
ABF Substrate & MLCCs (+94%): $12,690 → $24,660. Decoupling and packaging for a rack drawing over 100 kilowatts. Beneficiaries: Murata, Taiyo Yuden, Ibiden, Shinko.
Power Delivery & Cooling (+23%): $122,210 → $148,080. High-efficiency power modules and liquid cooling systems needed to step down high voltages at the rack. Beneficiaries: Vertiv, Delta Electronics, Lite-On.
Total Rack Cost (+95%): $3,994,551 → $7,803,148. A near-doubling in cost that restricts state-of-the-art scale-up clusters to elite hyperscalers and sovereign programs.

Strategic read

HBM is the inference bottleneck in plain English. Every token requires the model to be read once. The HBM bandwidth number sets the floor on tokens per second per chip.

For Pere, the lens is this: HBM capacity sets training scale, HBM bandwidth sets inference economics, and the HBM4 base-die strategy quietly decides which vendor leads the next two cycles.