Building agents · 5 of 7

Evals — how do you know it's good?

A demo proves the agent can succeed once. An eval proves the agent succeeds reliably, at known rates, on known kinds of inputs. Most agent projects fail because the team did not invest in evals early enough. Evals are the engineering discipline that separates a working agent from an agent that works.

Where the binding constraint sits today

What do you evaluate, at what granularity, with what ground truth, and how often? These four questions have no single right answer — they depend on the job. The framework below is how serious agent teams decide.

What to evaluate — outcomes, not behaviors

The first design choice is what success even means. For a workflow tool, this is easy: success is each step running without error. For an agent, the steps vary across executions, so step-by-step matching is useless. The right primary metric for an agent is whether the goal was accomplished, not whether the agent did things in the expected order.

This sounds obvious and is constantly violated. Teams build agent evals that check for specific tool calls, specific intermediate outputs, specific phrasings. Those evals fail the moment the agent finds a different correct path. Outcome-based evals — did the email get sent, did the ticket get resolved, did the calendar event end up where it was supposed to — are the only kind that scale with agent capability.

Secondary metrics matter and are layered on top. Cost per execution. Steps per execution. Tool-call accuracy. Latency. Each of these should have its own threshold, separately tracked. But the primary metric is always outcome.

The golden set

A golden set is a curated collection of input scenarios paired with the correct outcome the agent should reach. It is the spine of any serious eval system. Twenty to two hundred scenarios is the typical range. Each scenario captures a real input the agent will see (or a representative synthetic equivalent) plus the expected outcome.

The golden set is built by the team that knows the domain — usually a product manager, a domain expert, or a customer-facing employee, not an engineer. The engineering effort is in running the agent against the set and grading the outcomes. The domain effort is in curating the scenarios so that they actually cover the cases the agent will see in production.

Building a golden set that reflects production is the hardest unsexy work in agent development. Teams that skip it cannot tell whether a change improved or degraded the agent. Teams that build it badly end up with an agent that performs well on the set and badly on real users. The golden set is the eval system; everything else is plumbing around it.

20–200

Scenarios in a typical production golden set

1–2 weeks

Domain-expert time to build the initial set for a focused agent

Offline evals vs online evals

Offline evals run the agent against the golden set in batch, on demand, and produce a pass-rate plus diagnostic information. They are fast feedback for engineering. Every change to the agent — new prompt, new tool, new model — gets run against the set before shipping. This is the standard regression-test pattern adapted for agents.

Online evals run on real production traffic, in real time or near-real time. They measure what the agent actually did for real users. The signal is more authoritative than offline because it covers the real distribution, but it is slower (you have to wait for users to do things) and noisier (you cannot compare the same input across model versions).

Production agents need both. Offline evals catch regressions on known cases. Online evals catch drift on unknown cases — new types of input the team did not anticipate, edge cases the golden set did not cover. The healthy ratio of engineering attention is roughly two-thirds offline, one-third online; the online effort is mostly in instrumentation rather than in scenario design.

LLM-as-judge and its biases

For evals beyond simple pass/fail, the question of how to grade outcomes arises. Some outcomes are checkable programmatically (did the ticket close? was the email sent?). Others are subjective (was the email good? was the answer helpful?). For subjective outcomes, the dominant pattern is LLM-as-judge: a second LLM call evaluates whether the agent's output meets a rubric.

LLM-as-judge has known biases. The judge prefers longer outputs over shorter ones, even when the shorter is better. It prefers outputs that sound confident and professional. It rewards outputs that look like other outputs it has seen, which means style-conformant agents score artificially well. It is sensitive to the order in which it sees options in pairwise comparisons. These biases are not deal-breakers, but they are real and need to be controlled for.

The practical pattern is to use LLM-as-judge for first-pass triage and human review for the cases where the judge is uncertain or where the stakes are high. Pairwise comparisons (give the judge two outputs and ask which is better) are usually more reliable than absolute scoring (give the judge one output and ask if it is good). For high-stakes evaluation, do not delegate grading entirely to another model.

Granularity — per-step, per-tool-call, or per-task?

An agent execution has a tree of decisions: which tool to call, what parameters, how to interpret the result, what to do next. Each level is potentially eval-able. The question is which level produces useful signal.

Per-tool-call accuracy (was the right tool called with the right parameters?) is useful for debugging but not for the primary success metric, because two different tool sequences can both be correct. Per-step grading (was each reasoning step sound?) is useful for catching narrowly wrong reasoning but does not aggregate to outcome quality. Per-task grading (did the overall goal succeed?) is the most aligned with what production users care about and is therefore the right primary granularity.

The right composite eval looks like this: per-task outcome is the headline metric. Per-tool-call accuracy and per-step coherence are diagnostic metrics, used to investigate which sub-component of the agent is responsible when the outcome is wrong. They inform engineering decisions but not the ship/no-ship decision.

Production telemetry as the eval substrate

The most important eval system, in the end, is not the offline golden set. It is the production telemetry from real users running real tasks. The golden set is the engineering tool; the telemetry is the truth.

Production telemetry needs three properties to be usable as an eval substrate. First, every execution should be logged with input, output, and intermediate trace (the agent's tool calls and reasoning). Second, every execution should be tagged with the outcome — usually inferred from a downstream signal, like 'did the customer reply with thanks?', 'did the ticket reopen?', 'did the human override the agent's decision?'. Third, the team should be able to slice telemetry by cohort (user type, task type, agent version) and compute the outcome rate per slice.

Teams that have built this pipeline can answer questions like 'what is the success rate for first-time users on task type X over the last week?' in seconds. Teams that have not are flying blind in production. The single highest-leverage investment in agent reliability after the first month of operation is the telemetry-to-evaluation pipeline.

Eval drift and golden-set maintenance

Eval drift is what happens when the agent improves and the golden set does not. The set becomes too easy; the agent passes all of it but still has new failure modes in production. Drift is unavoidable; the maintenance discipline is to add new scenarios to the golden set whenever a new production failure is found. Each production incident should produce at least one new eval scenario that captures the failure, so the next regression catches it.

The maintenance budget for a golden set is roughly an hour per week per agent in active use. Less than that and the set goes stale; more than that is over-investment. The set should grow slowly and steadily, not in big batches. Old scenarios that no longer reflect production behavior should be retired, but retiring is rarer than adding.

Eval frameworks — build or buy?

Several open-source and commercial eval frameworks exist as of 2026: Inspect (Anthropic, open source), Promptfoo, Braintrust, OpenAI Evals, Langfuse, Weights & Biases agent tooling, and others. They all do roughly the same thing: provide a harness for running scenarios, calling agents, grading outputs, and storing results.

For a small team with a single agent, an off-the-shelf framework saves engineering time. The cost of building eval infrastructure from scratch is typically two to three engineer-weeks; using an existing framework brings it to a few days. For a larger team with many agents and idiosyncratic needs, building a thin custom layer on top of one of these frameworks is often the right answer — keep the storage and reporting, customize the grading and the trigger logic.

The choice between frameworks matters less than the choice to build eval infrastructure at all. The dominant failure mode in 2026 is not choosing the wrong framework. It is shipping an agent with no evals and discovering production failures by customer complaint.

Strategic read

Eval discipline is the most reliable indicator of whether an agent team knows what it is doing. Vendors who can describe their golden set, their pass rates, their drift management, and their telemetry pipeline are operating a serious engineering process. Vendors whose answer is 'we tested it' are not, regardless of how good the demo looks.

For an operator evaluating agent products, the eval questions to ask are: what is your headline success rate on production traffic, broken down by task type? How is the golden set maintained? What is the gap between offline and online accuracy? What are the failure modes you have caught and how do you detect new ones? A vendor that answers these crisply has built something real. A vendor that does not has built a demo.