The physics floor
A transistor is a switch. Apply voltage to one terminal and a controlled current flows between the other two. Doped silicon — silicon with a few parts-per-billion of phosphorus or boron added — gives you the electrical asymmetry that makes the switch work. That is the entire foundation.
Three numbers govern what a transistor can do for you:
- Count. How many switches fit on a chip. A modern frontier accelerator carries 100–200 billion transistors.
- Switching speed. How fast each switch can flip. The clock has been stuck near 2–3 GHz for two decades because going faster melts the chip.
- Leakage. How much current flows through a switch even when it is “off.” Below ~5 nm process nodes, leakage starts to dominate the power budget.
Moore's law was the observation that count doubled every ~2 years. It held from 1965 to roughly 2015. It is over now — or, more precisely, it is alive only in the sense that we have replaced “more transistors per square millimeter” with “more square millimeters per package, stacked vertically, wired together cleverly.” The economics of the new vectors are different. We will return to that.
Why GPUs, not CPUs
A CPU is built to run one instruction stream as fast as possible. It spends most of its silicon on prediction, caches, and branch handling — the machinery for guessing what comes next when the program forks. A modern CPU has 8–128 cores. Each core is sophisticated. Each core sits idle most of the time, waiting for memory.
The transformer architecture, which dominates frontier AI, almost never branches. Every layer is the same shape: a giant matrix multiplication followed by an elementwise nonlinearity. There is no “if” to predict. There are billions of identical multiply-and-add operations to perform in lockstep.
A GPU is built for that workload. Tens of thousands of small cores, all executing the same instruction across different data. Almost no branch prediction. Almost no cache. The entire silicon budget goes into throughput: floating-point multipliers and the wires to feed them.
The same architectural choice applies to TPUs (Google), Trainium and Inferentia (Amazon), MI-series (AMD), and the lab-internal designs at Anthropic and others. They differ in the shape of their throughput — some have larger systolic arrays, some have more scalar tensor cores, some bind tighter to a specific numeric format — but the underlying argument is the same: AI is uniform parallelism, and silicon laid out for uniform parallelism beats silicon laid out for branchy serial work by 10× or more on the same transistor budget.
The three numbers that gate everything
For an AI accelerator, three numbers determine its place in the stack. Skip these and the chip race is incomprehensible.
1 — Compute throughput (PFLOPS)
Peak floating-point operations per second, almost always quoted at the lowest numeric format the chip supports natively. Today: 5–20 PFLOPS per chip at FP8, twice that at FP4. Vendors love this number because it is the biggest. It gates training throughput. It does not gate inference.
2 — HBM bandwidth (TB/s)
High-bandwidth memory: a stack of DRAM dies sitting next to the compute die, connected by thousands of short, fat wires. Bandwidth is how fast the compute can read its weights. For inference, this is the binding constraint — not FLOPS. A chip with twice the FP8 throughput but the same HBM bandwidth runs decode at the same speed. This is the most common misreading of a chip spec sheet.
3 — Scale-up domain (chip count)
The number of chips that can be treated as one logical accelerator via a tight, low-latency interconnect — NVLink, ICI, or Infinity Fabric. NVL72 means 72 GPUs presenting as one chip to the software layer. TPU v7p means 9,216 chips presenting as one. This number gates the size of the largest model you can train without crossing slower scale-out networks. It is the most under-discussed number in AI hardware.
See the live chip comparator for current values across vendors.
The frontier, today
As of mid-2026, four vendors are at the frontier. The gap between the leader and the bottom of the frontier is roughly one generation, ~18 months.
- NVIDIA. The default. Blackwell B200/B300 in deployment, Rubin shipping. NVL72 (72 GPUs as one) and the next-generation NVL576 are the scale-up plays. CUDA is the moat that survives every generation.
- Google. TPU v6 (Trillium) at scale, v7p in deployment for internal Gemini training. Largest scale-up domain in the industry by a wide margin. The 3D torus topology is the key architectural choice (more on that below).
- AMD. MI300X and MI350 in production, MI400 announced. The chip is competitive on raw FLOPS and HBM. The software (ROCm) is improving but still trails CUDA for most workloads.
- Lab-internal. Anthropic, OpenAI, Meta, and Amazon all have internal accelerator programs. None are publicly comparable on per-chip throughput, but they matter strategically: a lab that owns its own accelerator owns its own cost curve.
Rate of improvement
With raw transistor scaling slowing, the per-generation gain — still ~2× in real-world AI throughput every 18 months — comes from four compounding vectors:
- Numeric format. FP16 (2017) → FP8 (2022) → FP4 (2024) → FP2 (on roadmaps). Each halving doubles throughput per transistor. The trick is keeping training numerically stable at lower precision, which has turned into a major research line.
- Packaging. Stacking compute dies onto interposers (CoWoS, the key TSMC technology) lets you put more silicon and more HBM into a single package. CoWoS supply is currently the bottleneck on global GPU production — not transistor manufacturing.
- HBM generation. HBM3 → HBM3e → HBM4. Each generation roughly doubles per-stack bandwidth. The ~3 vendors capable of producing leading-edge HBM (SK Hynix, Samsung, Micron) are themselves a supply-chain bottleneck.
- Scale-up domain. Going from NVL8 to NVL72 to NVL576 multiplies the largest model you can train coherently. This is the most expensive vector to advance — it requires new networking, new cooling, new rack architectures — but it has the longest legs.
Compounded, these get you the headline 2× per generation. Stripping any one of them out and the curve collapses.
Two architectural philosophies
GPU vs ASIC
A GPU is general-purpose: programmable enough that the same chip runs the next architecture you haven't invented yet. An ASIC (application-specific integrated circuit) is locked to a smaller workload but extracts more performance per watt and per dollar from that workload. TPUs are the cleanest example of an “AI ASIC”: ruthlessly specialized for matrix math at low precision, almost nothing else.
The trade-off is real. ASICs win on cost-per-inference for stable workloads. GPUs win when the workload changes — which it has, every 18 months, for a decade. Whether the trade-off shifts depends on whether transformer-shaped workloads stay dominant.
Scale-up vs scale-out
Two ways to put many chips together.
- Scale-up: a tight, low-latency, high-bandwidth fabric that makes N chips look like one chip. Bandwidth in TB/s per link. Distance in centimeters. NVLink, ICI, Infinity Fabric. Limited by physics — you can only run wires so far before signal integrity collapses.
- Scale-out: standard network fabric (InfiniBand, Ethernet) connecting scale-up domains. Bandwidth in 100s of GB/s. Distance in meters. Slower by an order of magnitude.
The scale-up boundary is where the architectural choice gets interesting. Anything that fits inside the scale-up domain runs as if on one chip. Anything that crosses out has to deal with the scale-out fabric being 10× slower.
3D torus vs all-to-all
Inside a scale-up domain, two topologies dominate.
- 3D torus (Google's choice for TPU pods): each chip is wired to its 6 spatial neighbors. Total wiring scales linearly with chip count, which is why TPU pods scale to 9,216 chips. Cost: any communication that has to go “across” the torus traverses many hops.
- All-to-all via switches (NVIDIA's NVLink switch fabric): every chip can talk to every other chip in one hop. Lower latency. Cost: switch silicon and cabling cost grow super-linearly with chip count, which is why NVL caps at 72 today and ~576 next.
Neither is universally better. Workloads with mostly-local communication patterns (early transformer layers; the residual stream within a model parallel split) run well on a torus. Workloads with global communication (all-reduce, expert routing in MoE models) want all-to-all. Frontier labs increasingly co-design the model architecture around the topology they have, not the other way around.
What this means for AI strategy
With the model in hand, the binding constraints in 2026 read clearly. They are not transistor count.
- HBM bandwidth caps inference economics. Per-token cost is set by HBM throughput. Doubling FLOPS without doubling HBM doesn't lower the price of an API call.
- Scale-up domain caps training scale. The largest model you can train without falling off the scale-up cliff is set by NVL72 / TPU pod size, not by total fleet size.
- CoWoS packaging caps supply. A frontier accelerator that can't be packaged can't ship. The CoWoS line at TSMC is the most important supply curve in AI.
- Numeric format dictates what model architectures are viable. Architectures that don't train at FP8/FP4 stably are at a structural cost disadvantage. The pressure on research is downward, not just outward.
- Power eats it all. A frontier training cluster wants 1–10 GW of continuous draw. The grid problem is the chip problem one layer down. See the Power section for that floor.
Read a chip launch through these constraints and you can tell within five minutes whether it will move the frontier or not. Most do not. The ones that do are the ones that move HBM, packaging, or scale-up domain — not the ones that move only PFLOPS.