Worked examples — three agents, end to end
Three real agent designs, walked through using the framework from chapters 1-6. One for a small business (customer support), one for a corporate function (strategy research), and one for an individual professional (legal contract review). Each shows the scoping decisions, the tool surface, the memory architecture, the eval design, and the failure handling.
By the end of this chapter the reader should be able to apply the framework to their own situation and produce a credible one-page agent spec they could hand to an engineering team.
Example A — SMB customer-support agent
The context: a small business with $5-50M ARR, a customer-support team of three to ten people, and a steady volume of tickets that follow predictable patterns. Most tickets are tier-1: order status, shipping questions, password resets, basic account questions, simple refunds. A meaningful fraction are tier-2 (needs human judgment) and a small fraction are tier-3 (complex, escalated). The agent's job is to resolve tier-1 entirely and to route tier-2 and tier-3 to the right human with the relevant context already attached.
Scope (chapter 2): the agent handles tier-1 fully without human approval, drafts responses for tier-2 that a human reviews and sends, and escalates tier-3 to a human with a context summary. The agent-vs-human boundary is the ticket complexity classification, which the agent makes first and then either acts on or hands off. The decision is the most important specification in the design.
Tools (chapter 3): six tools. (1) get_customer_record — pulls the customer's order history, account status, and preferences. (2) get_order_status — fetches current shipping and fulfillment state from the order management system. (3) search_help_docs — semantic search over the company's help-center articles. (4) send_reply — drafts a customer reply (gated by a confirmation step for tier-2 and tier-3). (5) issue_refund — initiates a refund up to a per-ticket cap of $200 without human approval; above that, escalates. (6) escalate_to_human — routes the ticket to the right team queue with a context summary. The tools are scoped tightly; there is no general 'do anything with the database' tool.
Memory (chapter 4): structured. A small per-customer record in the database stores preferences and prior ticket summaries. The episodic log is the ticket history itself, already in the help desk. No vector store needed at this scale; the help docs are small enough that the search tool can use a simple full-text engine.
Evals (chapter 5): golden set of about 100 tickets covering the common tier-1 cases, the tricky tier-2 cases, and the edge cases that should escalate. Outcome metric is per-ticket resolution rate and accuracy of the tier classification. Online eval tracks customer-reply rate (did the customer reply with thanks vs reply with frustration) and reopen rate (did the ticket reopen within seven days). Pass-rate target before launch: 85% on tier-1, 95% on tier classification.
Failure handling (chapter 6): the refund tool has a hard cap. The send_reply tool drafts only; tier-2 and tier-3 drafts require human approval. The escalate path is preferred over the resolve path when the agent's confidence is below a threshold. Cost cap is $5 per ticket; loops are caught by a hard step limit. PII handling: the agent's tools only return data scoped to the current customer. Time-to-stop: a human can pause the agent in seconds via the help desk admin.
Verification cadence: daily review of escalation rate and refund rate. Weekly review of reopen rate by ticket category. Monthly golden-set re-run after any prompt or model change.
Example B — Fortune-500 strategy-team research agent
The context: a strategy team at a large company is asked to produce briefs on companies, technologies, or markets at the request of executives. A typical brief is 3-5 pages, takes a research analyst one to two days, and pulls from public filings, market data subscriptions, and internal CRM. The volume is dozens of briefs per quarter. The agent's job is to draft the first version of each brief, which a human analyst then edits and finalizes.
Scope (chapter 2): the agent produces a first-draft brief on a named target (company, technology, or market). It does not deliver final briefs to executives; the human analyst is always in the loop for final review and editorial judgment. The agent reduces analyst time per brief from two days to four hours, and increases the team's capacity proportionally.
Tools (chapter 3): twelve tools, organized into three groups. Public data tools: fetch_filings (SEC, foreign equivalents), fetch_press_releases, web_search, fetch_market_data (subscribed sources). Internal tools: query_crm (deal history, account notes), query_internal_research (prior briefs, working files), fetch_calendar (executive priorities). Drafting tools: outline_brief (creates a section outline from a topic), draft_section (produces a section given an outline node and gathered facts), check_citations (verifies that asserted facts have corresponding sources), assemble_brief (compiles sections into the final draft format), summarize_for_executive (produces an executive-summary version).
Memory (chapter 4): hybrid. Structured: a per-topic record stores the last-brief-version, the requesting executive, and the working notes. Episodic: the conversation history with the requesting analyst. Vector retrieval: a focused vector store over prior briefs and internal research, used by query_internal_research to surface relevant prior work. The public-filings corpus is too large and too structured for vector retrieval; it is accessed via the structured fetch tool with date and topic filters.
Evals (chapter 5): golden set of 30 prior briefs with their final versions. The eval is comparison-based: the agent produces a draft from the same prompt that the human worked from, and an LLM-as-judge evaluates which version is closer in quality to the final, with human spot-checks. Pass criterion: agent draft is judged 'within editing distance' (less than three hours of edit time to bring to final quality) on 80% of the set.
Failure handling (chapter 6): hallucinated citations are the highest-risk failure. The check_citations tool runs before draft delivery and refuses to ship any unverified fact. Prompt-injection risk: the agent reads untrusted external content (press releases, web pages); the tool surface is read-only on external sources, and writes (drafts) are reviewed by humans before any leave the team. Cost cap: $20 per brief. Time-to-stop: any analyst can kill an in-progress draft via the team's internal UI.
Verification cadence: every brief is reviewed by a human analyst before delivery. Weekly review of analyst-edit time per brief, used to detect quality drift. Monthly golden-set re-run with executive-summary feedback if the executive consumer evaluates the brief.
Example C — Individual lawyer's contract-review agent
The context: a transactional lawyer reviews dozens of contracts a week. Most are variations on familiar templates (commercial agreements, NDAs, employment contracts, SaaS terms). The lawyer has a personal 'playbook' — a set of preferred clauses, redlines they always make, and red-flag clauses they always reject. The agent's job is to do a first-pass review of each contract, mark clauses against the playbook, and produce a redline draft that the lawyer reviews and refines.
Scope (chapter 2): the agent reviews a contract PDF, classifies it by type, identifies the clauses, compares each against the playbook, and produces a redline document. The lawyer reviews the redline and either accepts, modifies, or rejects each suggested change. The agent never sends the redline to the counterparty; that is always the lawyer's action.
Tools (chapter 3): eight tools. parse_pdf (extracts structured text and metadata from a contract PDF), classify_contract (determines contract type from the parsed text), extract_clauses (identifies and labels clauses), match_against_playbook (compares each clause to the lawyer's playbook), generate_redline (produces a marked-up draft showing proposed changes), explain_change (generates a one-sentence rationale for each redline), flag_red_flags (highlights any clauses that match the lawyer's reject list), produce_summary (one-page summary of the contract and proposed changes).
Memory (chapter 4): structured. The playbook itself is a structured document the lawyer maintains. Per-contract record stores the parsed contract and the redline history (for cases where the lawyer revisits the same contract). No vector store needed; the playbook is small enough to fit in the prompt directly.
Evals (chapter 5): golden set of 40 contracts with the lawyer's final redlines. Eval is per-clause precision and recall: did the agent flag the clauses the lawyer would flag, and did it not flag clauses the lawyer would not? Pass criterion: 90% precision (when the agent flags, the lawyer agrees), 80% recall (the lawyer almost never adds a clause the agent missed). Precision is weighted higher than recall because false-flag noise is more annoying than a missed clause the lawyer catches on review.
Failure handling (chapter 6): the agent does not auto-send any communication. Every output is staged for the lawyer. Cost cap: $1 per contract. The most consequential failure mode is a missed red-flag clause; the agent's red-flag detection is run twice independently (different prompts) and any disagreement is surfaced to the lawyer. PII handling: contracts often contain sensitive personal and financial data; the agent runs locally or in a trusted-tenant cloud environment, never in a shared SaaS without specific data-processing agreement.
Verification cadence: every contract is reviewed by the lawyer. Weekly review of lawyer-override rate by playbook clause; clauses the lawyer overrides repeatedly are signals to update the playbook (and the agent's behavior follows automatically).
The pattern across the three
All three agents are scoped narrowly relative to the full job they could conceivably do. The customer-support agent does not negotiate refunds above a cap. The research agent does not deliver to executives directly. The contract-review agent does not send communications to the counterparty. Each agent has a clearly-drawn agent-vs-human boundary, and that boundary is the central design decision.
All three tool surfaces have fewer than fifteen tools, and the tools are at the level a human in the same role would think about (not at the level the underlying API exposes). All three memory architectures default to structured storage with selective use of more complex patterns only where the job demands. All three eval systems run against a golden set with explicit outcome metrics and clearly-stated pass criteria. All three failure-handling stories include hard caps, confirmation gates for irreversible actions, and a designed time-to-stop.
The recurring engineering reality: building these agents takes weeks, not days. Most of the time goes into tools, evals, and failure handling, not into the model prompt. The prompt is a few hundred words and rarely the bottleneck on quality. The tools, the eval set, and the production telemetry are the bulk of the engineering investment.
A one-page spec template the reader can use
Use the seven-section structure below to produce a credible one-page spec for any new agent. The act of writing the spec usually surfaces the decisions the team has not yet made, which is half the value.
- Job: the specific role / function / task the agent replaces or augments. One paragraph.
- Scope: the agent-vs-human boundary. What the agent owns end-to-end; what it hands back to a human; what it never does. One paragraph.
- Tools: the named list of tools (5-15), with one-line descriptions and the rationale for each.
- Memory: structured, episodic, semantic; what each holds; the retrieval pattern.
- Evals: the golden set size, the headline metric, the pass criterion for launch, the online-eval plan.
- Failure handling: the per-execution cost cap, the loop limit, the irreversible-action gates, the time-to-stop, the worst-day plan.
- Verification cadence: who reviews what, how often, and what telemetry feeds the review.
Strategic read — closing the framework
The framework across chapters 1-7 is the operating discipline of building agents in 2026. It is not the only way, and the field will evolve. But the framework's core claim — that the agent-vs-human boundary, the tool surface, the memory architecture, the eval design, and the failure handling are all first-order engineering decisions that together determine whether the agent works in production — has held across the diversity of agent products that have shipped over the last twelve months.
For a reader who has worked through the seven chapters, the next move depends on the role. An engineer should pick a small concrete agent and build it end-to-end, going back to the chapters when the relevant question comes up. A product manager should write a one-page spec for an agent their team is considering and use the spec to surface the decisions the team has not yet made. An investor or operator should use the framework as the lens through which to evaluate agent products from vendors: ask the seven questions, listen for the answers, and weight the vendor's quality accordingly.
The deeper claim of the framework is that agents are an engineering discipline, not a model-prompting discipline. Teams that internalize this build agents that work. Teams that do not build agents that demo.