Loading
Loading
Cross-vendor comparison normalised to dense FLOPS (no structured-sparsity inflation). HBM bandwidth gates inference; scale-up domain size gates large-model training. The two together explain why NVL72 and TPU v7p are not interchangeable units even when their per-chip throughput looks similar.
| Chip | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| TPU 8I (Inference) | 2026-Q1 | 5.5 | 10.5 | 256 | 9.0 | 12288 | 750 | ||
| TPU v7p (Ironwood) | 2025-Q4 | 4.6 | — | 192 | 7.4 | 9216 | 720 | ||
| MI355X | AMD | 2025-Q4 | 5.0 | 10.1 | 288 | 8.0 | 8 | 1,400 | |
| B200 | NVIDIA | 2024-Q4 | 4.5 | 9.0 | 192 | 8.0 | 72 | 1,000 | |
| GB200 (Grace+B200) | NVIDIA | 2024-Q4 | 9.0 | 18.0 | 384 | 16.0 | 72 | 2,700 | |
| TPU v6 (Trillium) | 2024-Q4 | 0.92 | — | 32 | 1.6 | 256 | 350 | ||
| MI325X | AMD | 2024-Q4 | 2.6 | — | 256 | 6.0 | 8 | 1,000 | |
| H200 | NVIDIA | 2024-Q1 | 2.0 | — | 141 | 4.8 | 8 | 700 | |
| H100 | NVIDIA | 2022-Q3 | 2.0 | — | 80 | 3.4 | 8 | 700 |
Memory bandwidth, not compute throughput, is the gating bottleneck for inference. A chip with twice the FP8 PFLOPS but the same HBM bandwidth runs decode at the same speed. See Precision and Bandwidth for why.
The number of chips that can be treated as one logical accelerator via NVLink, ICI, or Infinity Fabric. NVL72 = 72 GPUs as one. TPU v7p pod = 9,216 chips as one. That ceiling shapes what models you can train without crossing slow scale-out networks.