ARC-AGI-3 Arrives and Every Frontier Model Scores Below 1%
A new benchmark resets the capability ceiling just as labs claimed reasoning benchmarks were nearly saturated.
9. ARC-AGI-3 Arrives and Every Frontier Model Scores Below 1%
François Chollet announced ARC-AGI-3 on April 30, 2026, and the opening numbers are unambiguous: every current frontier model sits below 1% on the new benchmark. The prior version, ARC-AGI-2, was released in early 2025 and took roughly twelve months before top models cracked double digits. ARC-AGI-3 resets that clock entirely. Chollet's post frames the sub-1% result as a starting point, not a verdict, and asks where scores will land by year-end.
The timing cuts against a narrative that has been building across the industry. OpenAI, Google DeepMind, and Anthropic have each pointed to near-saturated scores on benchmarks like MATH, GPQA, and HumanEval as evidence that frontier reasoning is maturing fast. ARC-AGI-3 makes that framing look premature. A benchmark where GPT-4o, Gemini 2.5 Pro, and Claude 3.7 Sonnet all cluster below 1% is not a marginal recalibration. It exposes a gap between benchmark-optimized reasoning and the kind of flexible, novel problem-solving ARC-AGI is designed to measure. For labs that have staked product positioning on reasoning breakthroughs, a public sub-1% ceiling is an uncomfortable data point to explain away.
The pattern here is familiar. ARC-AGI-1 held for years before scaling cracked it. ARC-AGI-2 held longer than most expected. Each iteration has raised the floor on what counts as general reasoning, and each time the benchmark has outpaced the models longer than the labs predicted. The number to watch is not the current score but the rate of improvement through Q3 2026. If scores stay below 5% by September, the benchmark will have done its job: keeping the industry honest about what "reasoning" actually means.
Source: @fchollet on X