What is reinforcement learning?
Pretraining learns from examples; reinforcement learning learns from outcomes. It is the loop behind RLHF, reasoning models, and agents — and increasingly where frontier progress and compute spend are going.
RL is how a model improves from its own attempts rather than from labeled data. The scarce inputs are reward signals that cannot be gamed and the rollout-and-verify infrastructure to generate them — which is why RL is as much an infrastructure story as a modeling one.
Learning from outcomes, not examples
In supervised learning — and in pretraining — a model is shown the right answer and nudged to reproduce it. Reinforcement learning removes the answer key. The model takes an action, receives a single number back — a reward — that says how good the outcome was, and adjusts to earn more reward next time. Nobody tells it the correct move; it discovers which behaviors pay off by trying them.
The vocabulary is small. A policy (the model) acts in an environment, collects a reward, and updates toward the actions that raised its cumulative reward. It is the same frame behind a dog learning a trick, AlphaGo learning to play Go, and a chat model learning to be helpful — only the environment and the reward change.
Why language models suddenly need it
A pretrained base model is brilliant autocomplete: it has read the world but has no idea which of its plausible continuations is actually useful, correct, or safe. Post-training is where RL enters. Instead of more text to imitate, the model is scored on its own outputs and pushed toward the ones that earn reward.
That is the shift from "predict the next token" to "produce the outcome that gets rewarded" — and it is where most of the behavior users actually feel is installed: helpfulness, refusal, calibrated reasoning, knowing when to call a tool. The base model supplies raw capability; RL decides how that capability behaves.
Decoding the acronym zoo
They all share one structure: generate attempts, score them, push the model toward the high-scoring ones. They differ only in where the score comes from — a human, a verifier, or another model.
- RLHF — RL from Human Feedback. People rank model outputs; those rankings train a reward model; the policy is optimized against it. This is what turned raw GPT-style models into usable assistants.
- RLVR — RL from Verifiable Rewards. Where an answer can be checked — math, code, unit tests — the reward comes from a verifier instead of a human. The signal is crisp and hard to fake, which is why it is the engine behind reasoning models.
- RLAIF — RL from AI Feedback. Another model supplies the preference judgments, scaling past the throughput of human labelers (the Constitutional-AI approach).
- PPO / GRPO. The optimization algorithms that actually move the weights. PPO is the long-time workhorse; GRPO (from DeepSeek) is a lighter variant that drops the separate value network.
Why RL is an infrastructure story
RL training does more than forward and backward passes. It has to generate — roll out — many candidate responses, run the environment or verifier to score each one, and only then learn. Generation, tool calls, and grading are often CPU- and orchestration-heavy, which is part of why the CPU:GPU ratio is rising: the GPU still does the matmul, but a growing share of wall-clock time is rollouts and verification.
The scarce input becomes reward quality. A reward the model can game — "reward hacking" — teaches exactly the wrong lesson, so a verifier you can trust is worth more than another trillion tokens. The RL era turns the bottleneck from data volume into signal integrity.
Strategic read
RL is where the frontier is moving. Reasoning models, agents, and long-horizon tasks are all RL-shaped: the model must act, see what happens, and improve. Pretraining buys raw capability; RL is increasingly what converts that capability into an edge.
For Pere, the read is that the binding constraint of the RL era is not parameters but verifiable reward and rollout infrastructure — the evaluation-and-data race on the model side, and a compute mix that favors more CPU per GPU on the infrastructure side. It also underwrites the recursive-self-improvement thesis: a model that can generate and grade its own attempts has a path to bootstrap.