Training in the Deployment Harness Closes the Benchmark-Production Gap

Cursor's Composer 2 trains coding models inside the actual deployment environment rather than on isolated benchmarks, eliminating the usual gap between test performance and real-world results. By running reinforcement learning on the same tools and structure deployed users see, the model learns to solve problems that actually matter instead of optimizing for curated datasets—a fundamental shift in how coding AI gets trained.

Most coding models train on curated problem sets and evaluate on SWE-bench. Cursor's Composer 2 takes a different path: its training infrastructure runs inside the same harness the deployed model uses, with equivalent tools and identical structure. This addresses the benchmark-to-production gap that plagues most coding models at the training level, rather than patching it with fine-tuning.

The training pipeline splits into two phases. First, continued pretraining sharpens the model's underlying coding knowledge and latent problem-solving capacity. Then large-scale RL (Reinforcement Learning) runs on top, using environments that mirror real user issues in structure and complexity instead of toy problems. The RL objective targets three things simultaneously: stronger multi-step reasoning, accurate execution across long action sequences, and coherence on long-horizon tasks where intermediate errors compound. The reward signal comes from an environment that matches the deployment context closely enough that what the model learns in training is what it needs for production.

The limitation is visibility: the abstract cuts off before the evaluation numbers appear, so benchmark performance comparisons are not available. The legible aspect is the design philosophy. For teams building or evaluating coding agents, the harness-alignment principle offers a transferable insight: RL trained against a proxy environment will optimize for that proxy, and the gap between proxy and production is where performance leaks. Composer 2 bets that closing the environment gap is worth the infrastructure cost of replicating the full deployment harness in training.

Key takeaways:

RL post-training runs inside the actual deployment harness with real tool access and structure, making the training environment the production environment rather than an approximation.
Splitting pretraining (knowledge/latent ability) from RL (end-to-end execution, coherence) suggests the two objectives require fundamentally different training regimes and should not be collapsed into one.
Teams building coding agents should audit the fidelity gap between their RL training environment and their deployment harness before investing in scale; environment mismatch likely explains more variance in production performance than model size.

Source: Composer 2 Technical Report