The Field's Go-To GUI Agent Dataset Actively Breaks Fine-Tuning

ProCUA-SFT shows AgentNet causes negative transfer in CUA fine-tuning, while 3.1M synthetic steps lift OSWorld from 26.3% to 45.0%.

The assumption baked into most GUI agent training pipelines is simple: more human trajectory data is better. AgentNet, with 22.5K human-collected desktop trajectories, is the largest public resource for computer-use agents. Fine-tuning UI-TARS 7B on it causes OSWorld success rate to fall from 26.3% to somewhere between 8% and 10%. That is not a marginal regression. That is the model forgetting how to do its job.

The failure mode here is negative transfer, and it is structural, not incidental. Human trajectories collected across heterogeneous desktop sessions carry implicit context assumptions, action distributions, and screenshot layouts that do not match the inference-time context a model actually sees. When a model trains on those trajectories, it updates toward behavior patterns that are locally coherent in the training data but misaligned with the step-prefix format used at evaluation. The mismatch is invisible at the dataset level and only surfaces when you measure downstream task completion.

ProCUA-SFT addresses this by building the dataset around the inference context from the start. A single VLM, Kimi-K2.5, handles goal generation, precondition checking, and trajectory execution, which eliminates the planner-actor capability gap that plagues multi-model pipelines where the model generating instructions differs in capability from the model executing them. Each of the 93K synthetic trajectories is then expanded into step-prefix samples: every training example reproduces exactly the screenshot-plus-action-history context the model will see at inference time. The result is 3.1M step-level samples that are format-consistent by construction, not by post-hoc filtering.

The grounding strategy matters as much as the format alignment. Rather than generating tasks from scratch against blank desktop states, the pipeline seeds live desktop environments with real-world content: 912 spreadsheets from SpreadsheetBench, roughly 10K permissively-licensed presentations from Zenodo10K, and multi-application OSWorld configurations spanning 2,484 application combinations. Before any rollout begins, a binary precondition check verifies the task is actually feasible in the current desktop state. This step alone removes a major source of noise in synthetic trajectory generation, where models frequently attempt tasks that cannot succeed given the current file or application state.

Fine-tuning UI-TARS 7B on ProCUA-SFT for one epoch yields 45.0% on OSWorld. That is an 18.7 percentage-point gain over the base model and more than 35 points above what the same base model achieves after AgentNet fine-tuning. A subset of ProCUA was incorporated into training for Nvidia's Nemotron 3 Nano Omni model. For teams building or fine-tuning computer-use agents, the takeaway is direct: the source and format of your trajectory data matters more than the volume, and the largest public dataset available is likely to hurt rather than help at current model scales.

We're thinking: We find the precondition-checking step underappreciated in the coverage this paper will receive. The headline number is the OSWorld jump, but the more durable contribution may be the feasibility verification gate before rollout. Synthetic trajectory pipelines fail quietly: the model attempts tasks that are structurally impossible given the desktop state, the trajectory looks plausible in the log, and the noise enters training without any signal that something went wrong. Treating precondition checking as a required infrastructure component, not an optional quality filter, reframes how teams should think about synthetic data generation for agentic tasks generally, not just GUI agents.

Key takeaways:

ProCUA-SFT aligns training data to inference-time context by expanding each trajectory into step-prefix samples and using a single VLM for goal generation, feasibility checking, and execution, eliminating format mismatch and planner-actor capability gaps.
Fine-tuning UI-TARS 7B for one epoch on 3.1M samples reaches 45.0% on OSWorld, up from 26.3% base and up from 8-10% after AgentNet fine-tuning; the caveat is that results are on a single benchmark and a single base model family.
Teams fine-tuning computer-use agents should audit their trajectory data for inference-time format alignment before scaling dataset size, and should treat AgentNet as a negative-transfer risk rather than a free performance boost.

Source: ProCUA-SFT Technical Report