Frontier Research Agents Pass at Under 22% on Consulting-Grade Work
A new benchmark with verifiable rubrics and cognitive traps reveals frontier deep research agents fail decision-grade consulting tasks at alarming rates.
Enterprise teams are deploying deep research agents on decision-grade work before anyone has measured whether those agents can actually do it. Existing benchmarks test factual recall and single-hop QA. The work being outsourced to these systems, structured analytical deliverables, competitive landscapes, due diligence memos, scenario analyses, is something else entirely.
The benchmark introduced here targets exactly that gap. Forty-two prompts authored by subject matter experts cover the kind of structured output that fills a management consultant's week. Each prompt generates a response from three frontier agents: Claude Opus 4.6 with web search, OpenAI o3-deep-research, and Google Gemini 3.1 Pro deep-research. Scoring runs on two layers simultaneously. A set of deterministic ground-truth verifiers, averaging 13.8 per task, checks whether specific facts, figures, and structural requirements are present. A five-criterion rubric scored 0-3 by SMEs evaluates reasoning quality, completeness, and analytical depth. These two layers compose into a single Verifier-Rubric Score (VRS) on a 0-100 scale. The key design choice is conjunctive: a response only "accepts" if it clears both a rubric mean of 2.5 and a verifier pass rate of 80%. Clearing one while failing the other counts as a failure. Most prompts also embed cognitive traps, surface-pattern shortcuts that produce plausible-looking but wrong answers, specifically to penalize agents that pattern-match rather than reason.
Acceptance rates under this joint threshold are uniformly low. Gemini reaches 21.4%. o3 and Claude each land at 9.5%. Mean VRS scores align with comparable rubric-based benchmarks, the top score here is 62.6 against APEX-v1's 64.2 and ProfBench's 65.9, which validates the rubric construct. The acceptance floor sits three points below APEX-Agents' dedicated DR agent band of 12.3-22.7%, a gap the paper attributes directly to stricter conjunctive grading and trap design. For teams evaluating which agent to deploy on analytical workflows, the takeaway is direct: overall VRS scores mask failure modes that only conjunctive grading exposes, and each agent fails in a structurally different way.
The failure signatures are worth naming precisely. Claude produces the required deliverable format most reliably, at 4.5 times the rate of the other two agents on file-required tasks, but carries the highest fabrication signature. o3 shows the cleanest average reasoning scores yet drops required sections and propagates arithmetic errors through calculations. Gemini is bimodal: it posts the highest acceptance rate overall while also accumulating the most zero-scored rubric cells, meaning it either performs well or collapses entirely with little middle ground.
We're thinking: We find the fabrication-versus-format tradeoff in Claude's profile particularly telling for enterprise deployment decisions. A system that reliably produces a well-structured deliverable containing confident fabrications is arguably more dangerous in a consulting context than one that produces incomplete work, because the former is harder for a non-expert reviewer to catch. The conjunctive grading design here is not pedantry; it is the right model for how consulting outputs actually get used. A memo that passes the smell test but fails on verifiable specifics does not become a good memo because it looks like one. Teams treating high rubric scores as sufficient signal for deployment readiness should treat this benchmark as a corrective.
Key takeaways:
- Conjunctive grading combining deterministic verifiers with SME rubrics exposes failure modes that either layer alone conceals, and cognitive traps built into prompts specifically penalize surface-pattern matching rather than genuine synthesis.
- Acceptance rates across three frontier agents top out at 21.4% and drop to 9.5%, with mean VRS scores in the 55-63 range; the benchmark aligns with APEX and ProfBench rubric scores but acceptance floors run lower due to stricter joint thresholds.
- Teams deploying deep research agents on analytical deliverables should audit against both format compliance and verifiable factual accuracy separately before trusting aggregate quality scores.
Source: Evaluating Deep Research Agents on Expert Consulting Work