How the lead changed hands
Agentic Coding
SWE-bench Pro
Leader: Claude Mythos Preview 77.8%preview·4 claims
Agentic Coding
SWE-bench Verified
Leader: Claude Mythos Preview 93.9%preview·12 claims
Terminal Coding
Terminal-Bench 2.0
Leader: GPT-5.5 Thinking 82.7%·7 claims
Multidisciplinary Reasoning
Humanity's Last Exam
Leader: Claude Mythos Preview 64.7%preview·4 claims
Leader: GPT-5.5 Pro 90.1%·4 claims
Leader: Claude Opus 4.7 77.3%·3 claims
Computer Use
OSWorld-Verified
Leader: GPT-5.5 Thinking 78.7%·4 claims
Financial Analysis
Finance Agent v1.1
Leader: Claude Opus 4.7 81.3%·1 claim
Security Research
CyberGym
Leader: Claude Mythos Preview 83.1%preview·2 claims
Graduate Reasoning
GPQA Diamond
Leader: Gemini 3.1 Pro 94.1%·8 claims
Visual Reasoning
CharXiv Reasoning
Leader: Claude Mythos Preview 93.2%preview·5 claims
Leader: Claude Mythos Preview 92.7%preview·2 claims
Leader: GPT-5.5 Thinking 84.9%·1 claim
Agentic Tool Use
Toolathalon
Leader: GPT-5.5 Thinking 55.6%·1 claim
Advanced Math
FrontierMath (Tier 1-3)
Leader: GPT-5.5 Pro 52.4%·1 claim
Advanced Math
FrontierMath (Tier 4)
Leader: GPT-5.5 Pro 39.6%·1 claim
Agentic Tool Use
TAU-bench Retail
Leader: Claude Opus 4.1 82.4%·2 claims
Agentic Tool Use
TAU-bench Airline
Leader: Claude Sonnet 4 60.0%·1 claim
Visual Reasoning
MMMU (Validation)
Leader: OpenAI o3 82.9%·2 claims
Leader: OpenAI o3 88.9%·2 claims
Tap or hover any segment to see the model and score.
Each strip is one benchmark; a colored segment shows the lab that held the lead during that window. Narrow segments are short reigns — the crown moved fast.