On-Policy Distillation Makes Models More Accurate and More Overconfident — Simultaneously

Standard model training methods boost accuracy but accidentally make AI overconfident—the model learns from information unavailable when actually deployed. A new approach separates accuracy training from confidence calibration, keeping accuracy gains while fixing the broken confidence estimates that break downstream systems like AI agents and retrieval pipelines.

On-policy distillation (OPD) has become the standard recipe for post-training language models: the student generates its own outputs, a teacher scores them, and the model trains on the result. Accuracy climbs. The assumption that a more capable model is also a better-calibrated one is incorrect.

The failure has a specific structural cause. During training, the teacher scores responses with access to privileged context: ground-truth labels, reference solutions, or other signals unavailable at deployment. The student learns to predict response success conditioned on that privileged information. At deployment, the privileged context is gone. The student's confidence still reflects teacher-time conditions, not deployment-time uncertainty. The result is systematic overconfidence: the model's stated probability of being correct far exceeds its actual accuracy. This is an information mismatch baked into the training objective, not a model size or data volume problem. The paper formalizes this as a Scaling Law of Miscalibration (more OPD training, more overconfidence) and proves theoretically that teacher-conditioned success is generally not a valid target for deployment-time confidence, and that helpful privileged context induces entropy collapse (the model's probability distribution collapses toward near-certain predictions) and a systematic optimism bias.

CaOP (Calibration-aware On-Policy distillation) addresses this by decoupling the accuracy signal from the calibration signal during training. Rather than letting the teacher's privileged scoring bleed into confidence targets, CaOP constructs calibration supervision using only the information available at deployment time, keeping the accuracy gains from OPD while correcting the confidence collapse. The limitation is scope: experiments focus on language model post-training benchmarks, and how far this generalizes to multimodal or code-generation settings with different privileged-context structures remains untested.

For teams deploying models in any setting where confidence scores drive downstream decisions, such as RAG (Retrieval-Augmented Generation) pipelines that threshold on model certainty, agent systems that use confidence to decide when to escalate, or risk-sensitive applications, this matters immediately. A model that has been through OPD post-training may be more accurate on your benchmark and dramatically less trustworthy in its uncertainty estimates.

Key takeaways:

OPD training creates an information mismatch: teacher supervision uses privileged context unavailable at deployment, producing entropy collapse and systematic overconfidence independent of accuracy gains.
This is a structural property of the training objective, not a scale or data artifact; more OPD training reliably worsens calibration even as it improves accuracy.
Teams using OPD post-trained models in confidence-gated pipelines should audit calibration explicitly; ECE (Expected Calibration Error) on held-out data before and after OPD is the minimum diagnostic to run.

Source: The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation