When the teacher cheats, the student memorizes instead of learns

When AI models learn from a "teacher" that sees the right answer, they memorize shortcuts instead of learning generalizable skills—a hidden failure of dense supervision. This work shows that mixing in sparse feedback from verifiable outcomes prevents this collapse, enabling stable long-term training where pure dense signals fail.

Dense supervision from a privileged teacher sounds strictly better than sparse reward signals. However, systematic experiments on on-policy self-distillation (OPSD), where the same model acts as both teacher and student with the teacher seeing reference answers, show that feeding the student exclusively on these rich signals causes information leakage and training instability that compounds over long runs.

The mechanism is structural. In OPSD, the teacher has access to the ground-truth answer at inference time. Its outputs are dense and fine-grained, but they are conditioned on information the student will never see during deployment. The student learns to replicate signals that are systematically out-of-distribution at test time—a form of shortcut learning baked into the supervision itself. Pure RLVR (Reinforcement Learning with Verifiable Rewards), by contrast, only provides sparse binary feedback from verifiable outcomes, forcing the model to generalize. Self-Distilled RLVR identifies an optimal mixing ratio between these two signal types: enough dense teacher guidance to accelerate learning, enough sparse verifiable reward to prevent the student from collapsing onto leaked information.

The limitation is meaningful: the abstract is truncated, so the specific optimal mixing ratio, benchmark results, and the exact instability dynamics are not confirmed here; verify the full paper before applying specific hyperparameter choices. The diagnostic principle holds: if your self-distillation setup gives the teacher privileged access to labels, pure dense supervision is not a safe default for long training runs. Teams using OPSD pipelines should audit whether their teacher signal is conditioned on information unavailable at inference and consider blending verifiable outcome rewards as a stabilizer.

Key takeaways:

Privileged teacher signals in OPSD create systematic information leakage; the student optimizes for out-of-distribution supervision, producing instability over long training horizons
This implies pure dense distillation and pure sparse RL occupy opposite failure modes; the optimal training signal lives at a mixture point between them, not at either extreme
Teams running self-distillation fine-tuning should check whether teacher outputs are conditioned on held-out labels and introduce a verifiable reward component to stabilize long-run training

Source: Self-Distilled RLVR