Peregrinations - Models: Frontier Labs, Launches & Architecture

State of the Art

Benchmarks tracked

Labs in the race

Lead changes

Most SOTA models

Anthropic · 10

Today

State of the Art Distribution

Anthropic10/20

OpenAI9/20

Google1/20

Leaderboard

Who holds each benchmark

preview

Agentic Coding

SWE-bench Pro

Anthropic

Claude Mythos Preview

77.8%

leading 52d

preview

Agentic Coding

SWE-bench Verified

Anthropic

Claude Mythos Preview

93.9%

leading 52d

Terminal Coding

Terminal-Bench 2.0

OpenAI

GPT-5.5 Thinking

82.7%

leading 49d

preview

Multidisciplinary Reasoning

Humanity's Last Exam

Anthropic

Claude Mythos Preview

64.7%

leading 52d

Agentic Search

BrowseComp

OpenAI

GPT-5.5 Pro

90.1%

leading 49d

Scaled Tool Use

MCP-Atlas

Anthropic

Claude Opus 4.7

77.3%

leading 52d

Computer Use

OSWorld-Verified

OpenAI

GPT-5.5 Thinking

78.7%

leading 49d

Financial Analysis

Finance Agent v1.1

Anthropic

Claude Opus 4.7

81.3%

leading 57d

preview

Security Research

CyberGym

Anthropic

Claude Mythos Preview

83.1%

leading 52d

Graduate Reasoning

GPQA Diamond

Google

Gemini 3.1 Pro

94.1%

leading 102d

preview

Visual Reasoning

CharXiv Reasoning

Anthropic

Claude Mythos Preview

93.2%

leading 52d

preview

Multilingual Q&A

MMMLU

Anthropic

Claude Mythos Preview

92.7%

leading 52d

Knowledge Work

GDPVal

OpenAI

GPT-5.5 Thinking

84.9%

leading 49d

Agentic Tool Use

Toolathalon

OpenAI

GPT-5.5 Thinking

55.6%

leading 49d

Advanced Math

FrontierMath (Tier 1-3)

OpenAI

GPT-5.5 Pro

52.4%

leading 49d

Advanced Math

FrontierMath (Tier 4)

OpenAI

GPT-5.5 Pro

39.6%

leading 49d

Agentic Tool Use

TAU-bench Retail

Anthropic

Claude Opus 4.1

82.4%

leading 311d

Agentic Tool Use

TAU-bench Airline

Anthropic

Claude Sonnet 4

60.0%

leading 386d

Visual Reasoning

MMMU (Validation)

OpenAI

OpenAI o3

82.9%

leading 179d

Advanced Math

AIME 2025

OpenAI

OpenAI o3

88.9%

leading 179d

The race

How the lead changed hands

2022–2026

2022

2023

2024

2025

2026

Agentic Coding

SWE-bench Pro

Leader: Claude Mythos Preview 77.8%preview·4 claims

Agentic Coding

SWE-bench Verified

Leader: Claude Mythos Preview 93.9%preview·12 claims

Terminal Coding

Terminal-Bench 2.0

Leader: GPT-5.5 Thinking 82.7%·7 claims

Multidisciplinary Reasoning

Humanity's Last Exam

Leader: Claude Mythos Preview 64.7%preview·4 claims

Agentic Search

BrowseComp

Leader: GPT-5.5 Pro 90.1%·4 claims

Scaled Tool Use

MCP-Atlas

Leader: Claude Opus 4.7 77.3%·3 claims

Computer Use

OSWorld-Verified

Leader: GPT-5.5 Thinking 78.7%·4 claims

Financial Analysis

Finance Agent v1.1

Leader: Claude Opus 4.7 81.3%·1 claim

Security Research

CyberGym

Leader: Claude Mythos Preview 83.1%preview·2 claims

Graduate Reasoning

GPQA Diamond

Leader: Gemini 3.1 Pro 94.1%·8 claims

Visual Reasoning

CharXiv Reasoning

Leader: Claude Mythos Preview 93.2%preview·5 claims

Multilingual Q&A

MMMLU

Leader: Claude Mythos Preview 92.7%preview·2 claims

Knowledge Work

GDPVal

Leader: GPT-5.5 Thinking 84.9%·1 claim

Agentic Tool Use

Toolathalon

Leader: GPT-5.5 Thinking 55.6%·1 claim

Advanced Math

FrontierMath (Tier 1-3)

Leader: GPT-5.5 Pro 52.4%·1 claim

Advanced Math

FrontierMath (Tier 4)

Leader: GPT-5.5 Pro 39.6%·1 claim

Agentic Tool Use

TAU-bench Retail

Leader: Claude Opus 4.1 82.4%·2 claims

Agentic Tool Use

TAU-bench Airline

Leader: Claude Sonnet 4 60.0%·1 claim

Visual Reasoning

MMMU (Validation)

Leader: OpenAI o3 82.9%·2 claims

Advanced Math

AIME 2025

Leader: OpenAI o3 88.9%·2 claims

Tap or hover any segment to see the model and score.

Each strip is one benchmark; a colored segment shows the lab that held the lead during that window. Narrow segments are short reigns, the crown moved fast.

The Frontier

Pace of Innovation

Cumulative flagship releases over time demonstrating the exponential acceleration of model capabilities and launch frequency.