Foundations · 4 of 5

Codesign: the structural moat behind the million-fold speedup

When chip, compiler, network, and model are designed against each other rather than in isolation, the system compounds. NVIDIA delivered a million-fold speedup over a decade while Moore's Law alone would have delivered roughly a hundredfold.

The idea, from Stanford to NVIDIA

Codesign as a concept entered computer architecture through the RISC work John Hennessy led at Stanford and David Patterson led at Berkeley in the early 1980s. The insight: a chip designed in isolation from the compiler will leave performance on the table; a chip designed against a specific compiler can be simpler and faster at the same time.

NVIDIA extended the same idea outward. Not just chip-compiler, but chip-compiler-network-rack-cluster-model-application. Every layer of the stack is designed knowing what the others can do, and every layer evolves with the others. The mental model is: there are no clean abstractions, only joint optimisations.

Why this beats Moore's Law

Moore's Law was a doubling of transistor density roughly every two years. At its peak, that translated to a 10x performance gain every 5 years, or a hundredfold over a decade. Dennard scaling, which let those denser transistors run cooler and faster, stalled around 2014. Pure semiconductor scaling has not delivered Moore-rate gains in over a decade.

NVIDIA delivered roughly a million-fold speedup on AI workloads from 2015 to 2025. The transistor count grew, but only by a factor of a few. The rest came from codesign: tensor cores, mixed precision (FP16, FP8, FP4), NVLink, NVSwitch, CUDA libraries, transformer-aware kernels, rack-scale interconnects, and model architectures tuned to the hardware they run on.

Source: Jensen Huang, Stanford CS153 Frontier Systems lecture, April 30, 2026 (https://cs153.stanford.edu/)

The full perimeter

NVIDIA's codesign perimeter now spans: the GPU itself (architecture, tensor cores, precision formats), the CPU (Grace and Vera designed for AI workloads, not cloud workloads), HBM and memory subsystem, NVLink and NVSwitch, the rack as a single computer (NVL72), inter-rack networking (InfiniBand, Quantum-X), the software stack (CUDA, TensorRT, cuDNN), and reference model recipes (Nemotron, BioNemo, Alpamayo, Groot).

Each layer is a real product with real customers. Each layer is also a constraint and an opportunity for every other layer. That is the moat: any competitor optimising one layer in isolation is competing against an integrated system that has been jointly optimised for a decade.

The model teaches the system: Nemotron as codesign instrument

Codesign is usually told top-down: design the chip for the workload. The harder and more important direction is the feedback loop, because you cannot design a chip for a workload you do not deeply understand. This is the real reason Nvidia builds its own frontier models. Bryan Catanzaro, who leads the Nemotron group, says the model's first job is to "help us understand how to build the systems of the future." The model is a probe into the workload, and what it reveals gets designed into the next generation of silicon.

Mixture of experts is the cleanest worked example. In such a model, a small router sends each token to a subset of specialised "expert" sub-networks, and which experts fire is impossible to predict in advance. Serving it well means many GPUs must read and write each other's memory at very high speed, because a single token's experts can live on different chips at every layer. That requirement is why Nvidia built NVL72, the rack that fuses 72 GPUs into one high-bandwidth memory domain. Catanzaro is blunt about the causality: "If we hadn't been working on understanding AI, we wouldn't have been able to build Blackwell properly." The model workload defined the rack.

The loop then runs back the other way inside a single design cycle. Nemotron 3 introduced "Latent MoE," which down-projects each token's vector before sending it across NVLink and reconstructs it on the far side, cutting interconnect traffic enough to fit roughly 4x the experts at the same inference cost. That is a model-architecture change made to fit the network that the model's own routing behaviour had motivated in the first place. Chip, network, and model are being tuned against each other at the same time, which is exactly the joint optimisation the rest of this note describes, observed live.

Source: Bryan Catanzaro (VP, Applied Deep Learning Research, Nvidia), The MAD Podcast with Matt Turck, July 2, 2026.

What this means for the rest of the field

AMD's MI series and ROCm software stack are an attempt to do codesign at smaller scope and faster cadence. Google's TPU + JAX + XLA pairing is the same idea, contained to Google's internal workloads. Cerebras, Groq, and other specialised inference players push codesign in the opposite direction: less general, more optimised for a single workload shape.

The pattern across the field is the same: pure chip vendors are losing to integrated system vendors. The economic value moves to whoever can guarantee the joint behaviour of the entire stack.