Compute · Control plane

Why is the CPU:GPU ratio rising in AI infrastructure?

AI training was a GPU story. Production inference, reinforcement learning, and agentic workflows are CPU stories layered on top of GPU kernels. Capacity planning has to model both.

Where the binding constraint sits today

A frontier cluster idles its expensive GPUs if the CPUs that feed them cannot keep up. As inference and RL grow faster than training, the CPU layer moves from a procurement detail to a first-class driver of cost, throughput, and energy per token.

The three pillars need a control plane

Storage, networking, and calculation are necessary but not sufficient. There is a fourth job on every accelerator cluster, and it sits on the CPUs: orchestration. Loading data, batching requests, paging the KV cache, routing between models, scheduling jobs, recovering from faults. None of this is tensor math, all of it gates how often the tensor math runs.

Industry conversation increasingly treats CPUs as the air-traffic controller for the GPU fleet. The framing comes from an Intel white paper on the rising CPU:GPU ratio in AI infrastructure (Varra, Ashtikar, Lal, Krishnapura, 2026), and it is useful even with the Intel commercial interest noted up front.

Source: Intel, "The Rising CPU:GPU Ratio in AI Infrastructure: Drivers, Trends, and Implications", 2026

Inference is overtaking training

The historic split between AI training and AI inference spend is reversing. Lenovo CEO Yuanqing Yang at CES 2026 called the shift from roughly 80 percent training and 20 percent inference toward 80 percent inference and 20 percent training. Deloitte estimates inference at half of AI compute in 2025 rising to two-thirds in 2026. Model API spending grew from 3.5 billion to 8.4 billion USD in 2025 per Menlo Ventures.

Inference is structurally more CPU-heavy than training. Training is dominated by dense matrix multiplications on large datasets, where the chip waits on math. Inference is a stream of small kernel calls wrapped in business logic, request routing, tokenisation, retrieval, and result formatting. The work outside the GPU starts to set the throughput floor.

50% → 67%

Deloitte estimate of inference share of AI compute, 2025 to 2026

$3.5B → $8.4B

Model API spending growth in 2025, per Menlo Ventures

~9%

AMD-reported GPU inference throughput lift from faster host CPUs

Source: Intel paper citations: Deloitte, Lenovo CES 2026, Menlo Ventures, AMD.com 2025

Agentic AI multiplies the orchestration load

Agentic workflows do not look like a single inference call. They look like a planner LLM, a tool router, a code executor in a sandbox, a verification agent, and a retry loop. Each step has a small GPU footprint and a large CPU footprint of state management, tool dispatch, and side-effect handling.

The Intel paper argues that breaking complex tasks into smaller structured subtasks plus code execution often beats invoking a single frontier model. The bargain is more CPU work in exchange for less GPU spend. Benchmark data in the paper, drawn from "Executable Code Actions Elicit Better LLM Agents" (Wang et al., 2024), shows code-as-action outperforming JSON-as-action and text-as-action across most current models.

Source: Intel paper Figures 1 and 2, citing Wang et al. and the Manus AI architecture paper

The Vera CPU: Hardware built for the Agentic Harness

NVIDIA’s custom ARM-based Vera CPU is built specifically to address the agentic orchestration bottleneck, opening a new $200 billion TAM and securing roughly $20 billion in revenue visibility in its first year. Codesigned from the ground up with NVLink and the Rubin GPU family, Vera shifts the industry paradigm from renting virtual machine cores ("dollars per core") to optimizing token execution ("tokens per dollar").

As Jensen Huang detailed in May 2026, agentic AI operates on a clear division of labor: the outer agentic orchestration harness—state management, Python execution, file I/O, tool calls—runs on CPUs, while the internal reasoning loops run on GPUs. When subagents are spun off recursively, they return to the GPU to "think". By offering 1.5x core performance, 2x performance per watt, and 4x rack density vs. x86, Vera ensures the CPU harness never starves the GPU engine.

Vera also changes how its own system memory is provisioned: it feeds from eight socketed SOCAMM modules rather than soldered LPDDR, so the harness's memory budget — 768 GB to 1.5 TB per CPU — becomes a per-deployment choice rather than a foundry-locked spec.

1.5x

Vera core performance increase vs. x86 alternatives

Rack density multiplier of Vera CPU vs. traditional x86

$20B

NVIDIA CPU revenue visibility for the current fiscal year

Source: NVIDIA Q1 2027 Earnings Call, May 2026

The data pipeline is mostly CPU

Even before a request reaches a GPU, a chain of preparation steps runs on host CPUs. The Intel paper offers an estimated CPU-bound share for each pipeline stage, based on production AI pipeline experience. The numbers are worth keeping nearby when modelling cluster utilisation.

Data ingestion: 85 to 95 percent CPU-bound. Read from storage, decompression, stream buffering.
Cleaning: 90 to 100 percent CPU-bound. Filtering, deduplication, validation.
Transformation: 70 to 85 percent CPU-bound. Resizing, normalisation, tokenisation.
Augmentation: 60 to 80 percent CPU-bound. Random crops, rotations, text perturbations.
Batching: 95 to 100 percent CPU-bound. Collation, padding, dimensional alignment.
Format conversion: 80 to 90 percent CPU-bound. PyTorch and TensorFlow tensor creation.

Source: Intel paper Table 2

Reinforcement learning is the second driver

Reinforcement learning has moved from games to robotics, autonomous vehicles, industrial control, finance, and LLM alignment. The architectural pattern is consistent: a learner runs on GPUs, but thousands of environment workers run on CPUs producing the rollouts the learner trains on.

IMPALA, Ray RLlib, AlphaZero, MuZero, and large-scale PPO all share this actor-on-CPU plus learner-on-GPU structure. As the environments get richer (high-fidelity physics, multi-agent driving, multi-sensor robotics), the CPU side of the cluster grows faster than the GPU side.

RLHF compounds the trend. Every frontier lab now uses some form of preference-based reward modelling, and the orchestration, sampling, and reward evaluation infrastructure lives on CPUs.

Source: Intel paper sections on RL training architectures

The Intel-authored caveat

The paper is Intel-authored, which matters. Intel sells CPUs and benefits commercially from the narrative that CPU growth is structurally underestimated. That is worth flagging before treating the numbers as neutral.

The directional claim, however, lines up with non-Intel sources cited inside the paper itself (Deloitte, McKinsey, Lenovo, AMD, Menlo Ventures, Futurum). The right read is that the CPU-growth story is real, the magnitude is contested, and the strongest commercial advocate happens to be the company that builds CPUs.

Strategic read

Treat CPU planning as a first-class variable in AI capacity models. A cluster designed for last-cycle training ratios (often 8 to 32 cores per accelerator) will underperform on agentic inference and RL workloads (now trending toward 64 to 128 cores per accelerator).

The deeper read for Pere: the same way HBM bandwidth quietly sets the price of a token, CPU orchestration quietly sets the price of an agent step. Both belong in the same mental model. Neither is the headline number on a chip datasheet.

Compute · Control plane

Why is the CPU:GPU ratio rising in AI infrastructure?

AI training was a GPU story. Production inference, reinforcement learning, and agentic workflows are CPU stories layered on top of GPU kernels. Capacity planning has to model both.

Where the binding constraint sits today

The three pillars need a control plane

Source: Intel, "The Rising CPU:GPU Ratio in AI Infrastructure: Drivers, Trends, and Implications", 2026

Inference is overtaking training

50% → 67%

Deloitte estimate of inference share of AI compute, 2025 to 2026

$3.5B → $8.4B

Model API spending growth in 2025, per Menlo Ventures

~9%

AMD-reported GPU inference throughput lift from faster host CPUs

Source: Intel paper citations: Deloitte, Lenovo CES 2026, Menlo Ventures, AMD.com 2025

Agentic AI multiplies the orchestration load

Source: Intel paper Figures 1 and 2, citing Wang et al. and the Manus AI architecture paper

The Vera CPU: Hardware built for the Agentic Harness

1.5x

Vera core performance increase vs. x86 alternatives

Rack density multiplier of Vera CPU vs. traditional x86

$20B

NVIDIA CPU revenue visibility for the current fiscal year

Source: NVIDIA Q1 2027 Earnings Call, May 2026

The data pipeline is mostly CPU

Data ingestion: 85 to 95 percent CPU-bound. Read from storage, decompression, stream buffering.
Cleaning: 90 to 100 percent CPU-bound. Filtering, deduplication, validation.
Transformation: 70 to 85 percent CPU-bound. Resizing, normalisation, tokenisation.
Augmentation: 60 to 80 percent CPU-bound. Random crops, rotations, text perturbations.
Batching: 95 to 100 percent CPU-bound. Collation, padding, dimensional alignment.
Format conversion: 80 to 90 percent CPU-bound. PyTorch and TensorFlow tensor creation.

Source: Intel paper Table 2

Reinforcement learning is the second driver

RLHF compounds the trend. Every frontier lab now uses some form of preference-based reward modelling, and the orchestration, sampling, and reward evaluation infrastructure lives on CPUs.

Source: Intel paper sections on RL training architectures