Deep Research Agents Fail on the Basics — and Current Benchmarks Can't See It

Research agents that write full reports are being tested on the live web—where results change daily and can't be repeated. A new evaluation framework replaces this chaos with frozen, realistic document collections for each task, finally making performance scores trustworthy and exposing hidden failure modes like poor citations or missing facts that single scores hide.

Deep Research Agents (DRAs), which are systems that plan, retrieve, synthesize, and generate full research reports, are being deployed in production while their evaluations run on live web environments that change daily and cannot be reproduced. Two teams running the same agent on the same task a week apart get different numbers. This indicates a broken measurement framework, not evaluation noise.

DR³-Eval (Deep Research Realistic and Reproducible Evaluation) fixes the environment problem by constructing per-task static sandbox corpora built from authentic user-provided research materials. Each sandbox simulates open-web complexity: supportive documents, distractor documents, and noise are deliberately mixed together, mirroring what a real retrieval pipeline encounters. Task definitions are grounded in actual user requests rather than synthetic prompts designed to be solvable. The corpus stays frozen, so any two evaluators running any agent get the same retrieval surface.

The evaluation framework measures five distinct axes: Information Recall (measuring whether the agent surfaced the relevant facts), Factual Accuracy (assessing claim correctness), Citation Coverage (determining proper source attribution), and instruction-following on multi-file, multimodal report generation tasks. These dimensions decouple failure modes that a single aggregate score would hide. An agent can score well on recall while fabricating citations, or follow formatting instructions precisely while missing key evidence entirely. The benchmark makes those failure patterns visible and reproducible.

The catch is scope. Static sandboxes trade live-web generalization for reproducibility. An agent that performs well here may still fail on tasks requiring fresh web retrieval, real-time information, or domains not covered by the benchmark corpus. DR³-Eval tells you how well an agent reasons and synthesizes within a controlled retrieval surface; it does not indicate how well it navigates the real open web.

For teams building or procuring DRAs, the immediate value is comparative: for the first time, you can run two agents against the same corpus and trust that performance differences reflect agent quality rather than environmental drift. The multi-dimensional scoring also gives product teams a decomposed signal; knowing whether a system fails on citation attribution versus factual accuracy points to very different remediation strategies.

Key takeaways:

Static per-task sandbox corpora replace dynamic web environments, making DRA evaluation reproducible for the first time; distractor and noise documents simulate real retrieval complexity without changing between runs
Current DRA benchmarks conflate environment variance with agent capability; the five-dimensional scoring framework exposes failure modes (recall vs. accuracy vs. citation) that aggregate scores suppress
Teams evaluating or procuring deep research systems should run DR³-Eval as a baseline before any live-web evaluation; decomposed scores on citation coverage and factual accuracy are more actionable than a single composite metric

Source: DR³-Eval: Towards Realistic and Reproducible Deep Research Evaluation