On-Policy Distillation Hurts When the Teacher's Context Is Wrong

A training-free diagnostic shows distillation guidance aligns with ideal gradients on incorrect rollouts but degrades on correct ones, breaking the 'distill from best model' default.

The default recipe for training reasoning models with on-policy distillation assumes more teacher signal is better signal. That assumption is wrong in a specific, testable way: when the student already answers correctly, the teacher's guidance actively misaligns with the ideal parameter update, injecting noise precisely where the model needs none.

The diagnostic framework introduced here operates without any additional training runs. Instead of measuring aggregate benchmark shifts after the fact, it derives what an ideal gradient would look like for each token: the parameter update that maximally increases the student's probability of success on that specific position. A scalable targeted-rollout algorithm then estimates this ideal gradient efficiently across long chains of reasoning steps, even when intermediate thoughts run hundreds of tokens deep. The alignment score is the cosine similarity between this ideal gradient and whatever distillation gradient a given teacher-context configuration actually produces. Think of it as a per-token quality meter for your distillation signal, one that runs before you commit to a training run.

The central finding holds across self-distillation and external teacher settings alike: distillation guidance shows substantially higher alignment with the ideal on incorrect rollouts than on correct ones. On correct rollouts, the teacher's signal trends toward noise. Beyond that, no single teacher-context configuration dominates across tasks and student capacities. The optimal pairing depends jointly on the student's capability and the target task, which means a configuration that works on one problem set can actively harm performance on another. For teams fine-tuning reasoning models with on-policy distillation, the takeaway is direct: run per-token alignment diagnostics before committing a teacher configuration, because the correct-rollout degradation is predictable and avoidable.

We're thinking: We find the correct-rollout finding more consequential than the headline framing suggests. The standard justification for on-policy distillation is that dense per-token supervision accelerates learning. That justification is only half-true: it holds on the tokens where the student is struggling, and it reverses on the tokens where the student is already succeeding. This means teams using distillation throughout training, without filtering by rollout outcome, are likely running a mixed-signal regime where gains on hard examples are partially cancelled by degradation on easy ones. The diagnostic framework is the more durable contribution here, because it gives teams a way to detect this before the training bill arrives, not after.

Key takeaways:

Gradient alignment scores, computed as cosine similarity between an ideal per-token update and the actual distillation gradient, reveal that teacher signal quality is not uniform: it is systematically higher on incorrect rollouts and systematically noisier on correct ones.
The finding holds across self-distillation and external teacher configurations; no universal teacher-context pairing outperforms across tasks, and the optimal choice shifts with student capacity and task domain, meaning single-configuration distillation pipelines carry hidden performance risk.
Teams training reasoning models with on-policy distillation should apply the targeted-rollout alignment diagnostic to audit their teacher configuration per task before training, and should consider filtering or down-weighting distillation signal on correct rollouts where gradient alignment is measurably low.

Source: Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why