AI Research Agents Fabricate Citations at 21%: A Verifiability Crisis

ScientistOne's Chain-of-Evidence framework exposes systematic fabrication in autonomous research agents, achieving zero hallucinated references where baselines fail at rates up to 21%.

Surface-level evaluation of AI-generated research papers misses a structural failure hiding in plain sight. Autonomous research agents produce manuscripts that pass casual review, yet independently verifiable checks reveal fabricated citations, scores that cannot be reproduced, and method descriptions that diverge from the code that supposedly implements them.

The failure is not occasional. It is systematic. Across 75 papers spanning five frontier systems, every baseline exhibits at least one integrity breakdown. Hallucinated reference rates reach 21%. Score verification passes in as few as 42% of papers. Method-code alignment, the degree to which a paper's described approach matches its actual implementation, ranges from 20% to 80% across systems. These are not edge cases in otherwise reliable pipelines. They are the norm.

ScientistOne addresses this through Chain-of-Evidence (CoE), a verifiability framework built into the research process itself rather than bolted on afterward. The core design principle: every claim generated during literature review, solution discovery, or paper writing must remain traceable to its evidence source throughout the pipeline. Where existing systems treat citation and score reporting as downstream formatting tasks, CoE treats them as first-class constraints that the system must satisfy at each generation step. The analogy is version control: just as a commit history makes every code change auditable, CoE makes every factual claim auditable back to the source that licenses it.

CoE Audit, the companion evaluation layer, applies four integrity checks uniformly across any system: score verification, specification violation detection, reference verification, and method-code alignment. This matters because the audit is system-agnostic. Teams can run it against outputs from any autonomous research agent, not just ScientistOne.

ScientistOne achieves zero hallucinated references (0 out of 337 checked), perfect score verification (12 of 12), and the highest method-code alignment in the benchmark (14 of 15), while matching or exceeding human expert performance across all five evaluated tasks. The system further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, reaching state-of-the-art on Parameter Golf and gold-medal performance on MLE-Bench tasks where baselines fail entirely. For teams deploying autonomous research agents in any capacity, the takeaway is direct: every AI-generated paper your pipeline produces needs a CoE-style audit layer before any downstream decision treats that output as ground truth.

We're thinking: We find the method-code alignment metric the most consequential finding here, and the least discussed in how people talk about AI research agents. Citation hallucination is visible if you check. Score fabrication is detectable if you re-run experiments. But a paper that accurately describes a method that the code does not actually implement is a different class of failure: it corrupts the scientific record in a way that only someone reading both the paper and the repository would catch. The 20-to-80 percent range across baselines on this metric suggests that some systems are essentially writing fiction about their own implementations half the time. Any organization using autonomous agents to generate technical reports, internal research, or regulatory documentation should treat CoE Audit not as an optional quality gate but as a mandatory pre-release check.

Key takeaways:

Chain-of-Evidence embeds traceability as a generation constraint throughout the research pipeline, rather than verifying claims after the fact, structurally preventing the fabrication modes that post-hoc review misses.
ScientistOne achieves 0/337 hallucinated references and 14/15 method-code alignment across 75 evaluated papers; the caveat is that benchmark tasks, while diverse, remain structured research competitions rather than fully open-ended scientific inquiry.
Teams using any autonomous research agent to produce papers, technical reports, or scored evaluations should run the four CoE Audit checks (score verification, specification violation, reference verification, method-code alignment) before treating any output as trustworthy.

Source: ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence