Claude Mythos Preview Doubles the Long-Horizon Bar on METR's Agentic Benchmark

Anthropic's Mythos Preview scores 2x the next best model on METR's 80% success-rate time-horizon test, reshaping the agentic capability competitive landscape.

6. Claude Mythos Preview Doubles the Long-Horizon Bar on METR's Agentic Benchmark

Anthropic shared an early snapshot of Claude Mythos Preview with METR, the independent AI evaluation organization, for testing on their long-horizon task benchmark. The result: Mythos Preview achieves a time horizon more than 2x that of the next best model at the 80% success-rate threshold. METR's benchmark measures how far into a multi-step task a model can operate before reliability collapses, making it one of the few public evaluations designed specifically for agentic, not chat, performance. The disclosure came via Alex Albert, Anthropic's head of developer relations, on May 7, 2026.

A 2x gap at the 80% success-rate line is not a marginal improvement. It means Mythos Preview can sustain reliable autonomous execution across task chains that are roughly twice as long as what OpenAI's best current models or Google's Gemini line can manage before failure rates climb past acceptable thresholds. For teams building production agents, that threshold matters more than peak benchmark scores: a model that degrades at step 12 instead of step 6 is not twice as good, it is qualitatively different for real deployments. Anthropic has been positioning Claude as the agent-first frontier model since the Claude 3 series, and this result gives that claim a concrete, third-party number to stand behind.

The metric METR uses here will likely become a standard competitive reference point. OpenAI's operator-facing roadmap for GPT-5 and its successors has emphasized agentic reliability, and Google DeepMind's Gemini 2.5 Pro has made similar positioning moves. Watch for both to publish or commission comparable long-horizon evaluations in the next 60 days. The real question is whether Mythos Preview's lead holds at the full release, or whether the "early snapshot" framing signals that the gap narrows before launch.

Source: @alexalbert__ on X