Loading
Loading
Chatbots, coding terminals, cloud agents, knowledge work — the apps that run on top of frontier models.
Past the chatbot demos and the job-loss headlines, AI is already changing the daily life of an elderly parent, a teenager learning a language, a small-business owner, and a software engineer. The honest accounting of who is getting what, today.
Jensen at Stanford CS153: design real evals, because otherwise teams optimise the number rather than the capability. Agent task completion rate beats MMLU.
How often the model solves a task in k attempts. Pass@1 measures reliability; pass@k measures whether the capability exists at all.
MMLU saturated. GPQA approaching. HLE and ARC-AGI are the current frontier evals. Each cohort of benchmarks lives roughly 18 months before it saturates.
How many tokens the model consumes to complete a typical task. Predicts inference cost. Reasoning models 10-100x conventional ones at the same accuracy.
End-to-end latency to complete a real workflow. Includes prefill, decode, tool calls, retries. The metric a human user actually feels.
For 60 years, software was prerecorded: write code, compile, ship, run.
Every token a model generates is either consumed by the model itself (thought) or sent to the world (action).
A language model works because it learns a representation of words, characters, and syntax.
AI is not just a futuristic promise or a simple corporate headcount trim.
AI replaces work where the task has enough digital context, cheap verification, clear handoff, and economic value to justify the failure handling.
Agents stick when they own a repeated job with context, tools, permissions, feedback, and a visible business outcome.
The honest answer is: probably not your whole job, but almost certainly some of your tasks.
The moat is rarely the model call. It is workflow ownership, proprietary context, distribution, trust, feedback data, and the cost of switching the operating loop.
Care about the apps that own painful workflows, improve with use, route models intelligently, and move from experiment budget to operating budget.
Most things called 'agents' in 2026 are chatbots with extra steps.
The hardest part of building a useful agent is picking the right job for it.
An agent without tools is a chatbot. An agent with too many tools is paralyzed. Tool design is the most underrated engineering decision in agent building: each tool is a small API the agent has to understand, call correctly, handle errors from, and combine with others. The principles below are what separate agents that work in production from agents that work only in demos.
An agent that forgets what it did yesterday is severely limited.
A demo proves the agent can succeed once. An eval proves the agent succeeds reliably, at known rates, on known kinds of inputs. Most agent projects fail because the team did not invest in evals early enough. Evals are the engineering discipline that separates a working agent from an agent that works.
Every agent in production will fail. The question is not whether but what shape the failure takes, who notices, and how the system recovers. The taxonomy below covers the failures that matter; ignore them and your agent will discover them for you in front of a customer.
Three real agent designs, walked through using the framework from chapters 1-6.
If models keep getting cheaper to deploy, the rate-limiting step for the AI-built economy is not building the product — it is getting paid for it.
Stripe is one of the only entities with a near-real-time view of how many AI-built businesses are forming, where they cluster, and how fast they monetize.
The shift from human buyers to agent buyers is not a UX change.
When a buyer is a human, sellers can learn the buyer's customer lifetime value over months of behavior.
Stripe is not the only candidate to win agentic payments.