Foundations · 4 of 5

Codesign: the structural moat behind the million-fold speedup

When chip, compiler, network, and model are designed against each other rather than in isolation, the system compounds. NVIDIA delivered a million-fold speedup over a decade while Moore's Law alone would have delivered roughly a hundredfold.

The idea, from Stanford to NVIDIA

Codesign as a concept entered computer architecture through the RISC work John Hennessy led at Stanford and David Patterson led at Berkeley in the early 1980s. The insight: a chip designed in isolation from the compiler will leave performance on the table; a chip designed against a specific compiler can be simpler and faster at the same time.

NVIDIA extended the same idea outward. Not just chip-compiler, but chip-compiler-network-rack-cluster-model-application. Every layer of the stack is designed knowing what the others can do, and every layer evolves with the others. The mental model is: there are no clean abstractions, only joint optimisations.

Why this beats Moore's Law

Moore's Law was a doubling of transistor density roughly every two years. At its peak, that translated to a 10x performance gain every 5 years — a hundredfold over a decade. Dennard scaling, which let those denser transistors run cooler and faster, stalled around 2014. Pure semiconductor scaling has not delivered Moore-rate gains in over a decade.

NVIDIA delivered roughly a million-fold speedup on AI workloads from 2015 to 2025. The transistor count grew, but only by a factor of a few. The rest came from codesign: tensor cores, mixed precision (FP16, FP8, FP4), NVLink, NVSwitch, CUDA libraries, transformer-aware kernels, rack-scale interconnects, and model architectures tuned to the hardware they run on.

Source: Jensen Huang, Stanford CS153 Frontier Systems lecture, April 30, 2026 (https://cs153.stanford.edu/)

The full perimeter

NVIDIA's codesign perimeter now spans: the GPU itself (architecture, tensor cores, precision formats), the CPU (Grace and Vera designed for AI workloads, not cloud workloads), HBM and memory subsystem, NVLink and NVSwitch, the rack as a single computer (NVL72), inter-rack networking (InfiniBand, Quantum-X), the software stack (CUDA, TensorRT, cuDNN), and reference model recipes (Nemotron, BioNemo, Alpamayo, Groot).

Each layer is a real product with real customers. Each layer is also a constraint and an opportunity for every other layer. That is the moat: any competitor optimising one layer in isolation is competing against an integrated system that has been jointly optimised for a decade.

What this means for the rest of the field

AMD's MI series and ROCm software stack are an attempt to do codesign at smaller scope and faster cadence. Google's TPU + JAX + XLA pairing is the same idea, contained to Google's internal workloads. Cerebras, Groq, and other specialised inference players push codesign in the opposite direction: less general, more optimised for a single workload shape.

The pattern across the field is the same: pure chip vendors are losing to integrated system vendors. The economic value moves to whoever can guarantee the joint behaviour of the entire stack.