Same Success Rate, Completely Different Failure Modes: Web Agent Eval Is Broken

WebStep's 1,800-task benchmark reveals that agents scoring identically on task success diverge sharply on where and how they fail mid-workflow.

Terminal success scores are the standard currency for web agent evaluation. Pass or fail, task complete or not. The assumption baked into every major benchmark is that final outcome captures enough signal to guide improvement. WebStep tests that assumption directly, and the answer is no.

The benchmark introduces semantic state tracking: each test website exposes a deterministic semantic MDP running in parallel with the GUI. The agent operates on the interface as normal, while the environment records high-level states and transitions in the background without any manual annotation. This means every step in a 1,800-task evaluation generates a labeled state transition, not just a binary outcome at the end. The result is a trajectory, not a verdict.

That trajectory structure is what makes the divergence visible. Three agents cluster within 31-33% on raw success rate, a gap small enough to call statistical noise. But their process-level profiles are distinct: some agents reach the right intermediate states but stall before committing, others execute final actions cleanly but misnavigate earlier in the sequence. Exploration reach and execution accuracy are separable dimensions, and terminal scoring collapses them into a single number that obscures both.

Decomposing by skill sharpens this further. On the Housing domain, OpenAI CUA outperforms Qwen3.5 by 23.7% on commit actions, the final confirmatory steps in a workflow. On filtering tasks within the same domain, it underperforms Qwen3.5 by 15.6%. Same website, same aggregate score range, opposite rankings depending on which skill you measure. A team trying to improve CUA based on task-level Housing scores would have no idea whether to fix filtering or commit behavior. WebStep pinpoints the answer.

Bifurcation analysis adds a third diagnostic layer: it identifies the single step where a task trajectory diverges from success, the decisive error that loses the run. Those errors are agent-specific rather than shared across systems, which means benchmark-level aggregates are averaging over structurally different failure modes. Difficulty scaling confirms the pattern. On easy tasks, success rates converge. As tasks grow harder and exploration depth increases, the gaps widen sharply, which means today's leaderboard numbers are most informative exactly where agents are least challenged.

WebStep scores 1,800 task instances with controlled difficulty across these three diagnostic dimensions, automatic throughout, no human annotation required. For teams building or selecting web agents for production workflows, the takeaway is direct: a single task-success number cannot tell you whether your agent fails because it cannot find the right state or because it cannot execute once there, and those two failure modes require completely different fixes.

We're thinking: We suspect the reliability problem here is larger than the benchmark gap suggests. If agents can reach correct terminal states via wrong intermediate paths, then success-rate numbers are not just incomplete, they are actively misleading: they credit agents for outcomes that would not survive in production, where intermediate state correctness matters for auditability, error recovery, and chaining into downstream tasks. WebStep's semantic MDP framing is the right architecture for exposing this, but the deeper implication is that every web agent deployment decision made on terminal-success benchmarks alone carries an unquantified reliability discount. The 31-33% cluster does not mean these agents are equivalent. It means the measurement was too coarse to tell them apart.

Key takeaways:

Semantic state tracking decouples exploration reach from execution accuracy, exposing skill-level rankings that aggregate success rates hide entirely.
Three agents clustering at 31-33% task success diverge by up to 23.7 percentage points on individual skills within the same domain; the benchmark covers 1,800 tasks with automatic annotation, though coverage is currently limited to the specific website environments included.
Teams evaluating or fine-tuning web agents should run process-level diagnostics before committing to architectural changes: the failure mode (navigation vs. execution) determines the intervention, and task-success scores alone cannot distinguish them.

Source: WebStep: Process-Level Evaluation of Web Agents with Semantic State Tracking