On-Policy Distillation Breaks at the Prefix, Not the Token

Trajectory-Refined Distillation names the exact structural failure in on-policy distillation and fixes it at the trajectory level, not the token level.

On-policy distillation is supposed to be the clean answer to distribution shift in LLM training: let the student generate its own rollouts, then apply dense teacher supervision at every token. The assumption has been that when quality degrades, the fix lives at the token level, in loss truncation, reweighting, or filtering. That assumption is wrong.

The structural failure happens earlier, at the prefix. When a student's rollout diverges from any coherent reasoning path, the teacher faces an impossible task: it must assign probabilities to tokens that follow a premise it would never have generated. The result is a bimodal distribution, the teacher simultaneously hedging across two incompatible continuations. Gradients from those competing modes cancel or fragment. Token-level interventions applied downstream of this collapse cannot undo it, because they treat symptoms rather than the source.

Trajectory-Refined Distillation (TRD) moves the correction upstream. Instead of reweighting losses on a broken rollout, it revises the rollout itself before distillation begins. Using the teacher as a guide, TRD replaces problematic prefixes with corrected alternatives that stay within the student's on-policy support, meaning the revision is not an out-of-distribution oracle trace but a reachable trajectory the student can actually learn from. This is a meaningful structural distinction. Prior methods kept the broken scaffold and adjusted how much the student trusted each rung. TRD replaces the rungs. The method also applies to on-policy self-distillation (OPSD), where the teacher is the student model conditioned on privileged information, a parameter-sharing setup that makes TRD accessible without a separate larger model.

There is a secondary benefit worth naming separately. Even when the student's original rollout is correct, TRD exposes the student to alternative valid derivations under teacher guidance. That broadens reasoning coverage without requiring additional data collection. Correct-but-narrow rollouts are a known ceiling on Pass@K diversity; TRD addresses it as a byproduct of the correction mechanism rather than as a separate objective.

Across benchmarks and base models at multiple scales, TRD consistently lifts both single-attempt accuracy and reasoning coverage over prior on-policy distillation baselines. The gains hold across model families, not just at one scale point, which rules out the explanation that TRD is simply better matched to a particular architecture. For teams running post-training distillation pipelines on reasoning models, the takeaway is direct: if your student quality plateaus or degrades under on-policy training, the failure is likely happening at the prefix, and token-level loss surgery will not fix it.

We're thinking: We find the prefix failure diagnosis more valuable than the method itself, and that is not a dismissal of TRD. The field has accumulated a collection of token-level interventions, each improving on the last, without anyone naming why the ceiling kept appearing. Bimodal teacher mixtures at diverged prefixes is a precise, testable claim about mechanism, not a vague appeal to distribution mismatch. The implication is that teams auditing distillation quality should inspect the teacher's output distribution at prefix boundaries, not just aggregate loss curves. If the teacher is hedging across incompatible continuations at a given prefix, no downstream reweighting scheme recovers clean gradients. That diagnostic framing changes what you instrument, not just what hyperparameter you tune.

Key takeaways:

Token-level loss interventions fail because on-policy distillation breaks at the prefix, producing bimodal teacher distributions and fragmented gradients that downstream reweighting cannot repair; TRD corrects the rollout before distillation rather than adjusting loss weights after.
TRD outperforms prior on-policy distillation baselines on single-attempt accuracy and reasoning coverage across multiple model scales and benchmark families, with the gains holding in both standard OPD and the parameter-sharing OPSD variant; the paper does not yet report wall-clock overhead for the trajectory revision step, which matters for production training budgets.
Teams running on-policy distillation for reasoning models should audit teacher output distributions at prefix boundaries, not just token-level loss curves, and consider TRD or trajectory-level correction as the intervention point when student quality plateaus.

Source: Trajectory-Refined Distillation