Claude Opus 4.7 Outpaces Its Own Human-AI Hybrid Baseline by 20x in Robotics Code
Anthropic's Project Fetch Phase 2 shows a single model lap a human-plus-AI team, signaling a threshold shift in autonomous coding capability.
2. Claude Opus 4.7 Outpaces Its Own Human-AI Hybrid Baseline by 20x in Robotics Code
Anthropic's Frontier Red Team published Phase 2 results from Project Fetch on June 16, 2026, a benchmark where Claude programs a physical robodog to retrieve a beach ball. Claude Opus 4.7, running without human assistance, completed the programming task approximately 20 times faster than last year's best human team working alongside Claude Opus 4.1. The robodog still failed to actually fetch the ball, but the speed gap between the two configurations is the finding that matters.
The comparison is deliberately structured to expose a specific shift: the human-plus-model pairing from 2025 is now the slower configuration. That reframes the competitive question for every lab racing to ship agentic coding tools. OpenAI's o3 and Google DeepMind's Gemini 2.5 Pro have both posted strong software engineering benchmarks in recent months, but those tests measure correctness, not autonomous throughput on physical hardware control tasks. Anthropic is staking out a different axis: speed of unassisted iteration in a domain where the feedback loop involves real-world failure states. A 20x gap on that axis, if it holds across task types, shifts the calculus on how much human oversight adds value versus slows output down.
The task still failed at execution. That detail is worth tracking closely. Faster code generation that produces incorrect robot behavior is a capability profile with obvious safety implications, and Anthropic's own framing of this as a "red team" exercise signals they are treating the speed gain as something to stress-test, not just celebrate. The next phase to watch is whether Opus 4.7's throughput advantage persists when correctness constraints tighten, and whether competing labs publish comparable physical-world benchmarks or continue to compete primarily on text-based coding evals.
Source: @AnthropicAI on X