Models · 6 of 10

Policy Optimization (RLHF & DPO)

Pretraining learns from examples; reinforcement learning learns from outcomes. It is the loop behind RLHF, reasoning models, and agents.

Where the binding constraint sits today

RL is how a model improves from its own attempts rather than from labeled data. The scarce inputs are reward signals that cannot be gamed and the rollout-and-verify infrastructure to generate them.

Learning from outcomes, not examples

In supervised learning—and in pretraining—a model is shown the right answer and nudged to reproduce it. Reinforcement learning removes the answer key. The model takes an action, receives a single number back—a reward—that says how good the outcome was, and adjusts to earn more reward next time. Nobody tells it the correct move; it discovers which behaviors pay off by trying them.

The vocabulary is small. A policy (the model) acts in an environment, collects a reward, and updates toward the actions that raised its cumulative reward. It is the same frame behind a dog learning a trick, AlphaGo learning to play Go, and a chat model learning to be helpful—only the environment and the reward change.

Why language models suddenly need it

A pretrained base model is brilliant autocomplete: it has read the world but has no idea which of its plausible continuations is actually useful, correct, or safe. Post-training is where RL enters. Instead of more text to imitate, the model is scored on its own outputs and pushed toward the ones that earn reward.

That is the shift from "predict the next token" to "produce the outcome that gets rewarded"—and it is where most of the behavior users actually feel is installed: helpfulness, refusal, calibrated reasoning, knowing when to call a tool. The base model supplies raw capability; RL decides how that capability behaves.

Decoding the acronym zoo

They all share one structure: generate attempts, score them, push the model toward the high-scoring ones. They differ only in where the score comes from—a human, a verifier, or another model.

RLHF — RL from Human Feedback. People rank model outputs; those rankings train a reward model; the policy is optimized against it. This is what turned raw GPT-style models into usable assistants.
RLVR — RL from Verifiable Rewards. Where an answer can be checked—math, code, unit tests—the reward comes from a verifier instead of a human. The signal is crisp and hard to fake, which is why it is the engine behind reasoning models.
RLAIF — RL from AI Feedback. Another model supplies the preference judgments, scaling past the throughput of human labelers (the Constitutional-AI approach).
PPO / GRPO. The optimization algorithms that actually move the weights. PPO is the long-time workhorse; GRPO (from DeepSeek) is a lighter variant that drops the separate value network.

Why RL is an infrastructure story

RL training does more than forward and backward passes. It has to generate—roll out—many candidate responses, run the environment or verifier to score each one, and only then learn. Generation, tool calls, and grading are often CPU- and orchestration-heavy, which is part of why the CPU:GPU ratio is rising: the GPU still does the matmul, but a growing share of wall-clock time is rollouts and verification.

The scarce input becomes reward quality. A reward the model can game—"reward hacking" (e.g., model writing verbose but incorrect answers to look helpful)—teaches exactly the wrong lesson, so a verifier you can trust is worth more than another trillion tokens.