AI agents read physics papers but do not reproduce them.

AI agents claiming strong coding skills fail to reproduce physics papers end-to-end, exposing a critical gap between understanding research and executing it. A new benchmark tested 30 physics tasks requiring agents to read papers and match published results—revealing that standard coding benchmarks mask blind spots crucial for automating real scientific work.

Many assumed end-to-end scientific reproduction—reading a paper, implementing the algorithm from scratch, and producing matching quantitative results—was just around the corner for capable coding agents. PRBench (Paper Reproduction Benchmark) conducted the actual experiment. Across 30 expert-curated tasks spanning 11 subfields of physics, current agents fail at rates that expose a hard structural gap between comprehension and execution.

The benchmark design is deliberately unforgiving. Agents receive only the paper content and a task instruction, with no scaffolding, pre-written skeleton code, or pre-installed dependencies. They operate in a sandboxed execution environment and must produce quantitative outputs that match the original publication's numbers. Domain experts across more than two institutions contributed each task, covering subfields from quantum mechanics to fluid dynamics and ensuring the task distribution reflects genuine research diversity, not cherry-picked tractable cases. The end-to-end framing matters. The benchmark does not award partial credit for "understood the method" or "wrote syntactically correct code"; matching the numbers is required.

The limitation is real: 30 tasks is a narrow sample. Physics-specific failure modes may not transfer cleanly to chemistry, biology, or materials science, where the gap between paper description and implementation differs. The benchmark also tests agents on published papers, whose methods are, by definition, already documented. The bar for reproducing published results is lower than for generating novel ones.

For teams building science agents or evaluating coding models on research tasks, PRBench reframes the target. Benchmark performance on HumanEval or SWE-bench measures different skills than what reproducibility actually demands: sustained multi-step reasoning over domain-specific notation, algorithm reconstruction from prose descriptions, and numerical validation against real experimental data. If an agent pipeline scores well on standard coding benchmarks without end-to-end reproduction testing, its breaking points remain unknown. End-to-end paper reproduction requires chaining paper comprehension, algorithm implementation from scratch, and quantitative output matching; this is a strictly harder task than any single component benchmark measures in isolation. Current agent performance on PRBench reveals that strong coding and reasoning scores on standard benchmarks do not transfer reliably to full scientific reproduction workflows. Teams building science agents should treat PRBench-style evaluation as a stress test before claiming research automation capability, as production-readiness lives in the difference between "can generate plausible code" and "can reproduce a published result."

Source: PRBench: End-to-end Paper Reproduction in Physics Research