On-Policy Distillation Without Logit Access: +28.64% on Math

OmniOPD removes the white-box teacher requirement from on-policy distillation, using chunk-level semantic verification to match or beat open-weight OPD baselines.

On-policy distillation assumes you can read the teacher's mind, token by token. That assumption locks out every proprietary model behind an API, which happens to include the strongest teachers available today.

The standard approach feeds a student model its own generated trajectories and corrects it using the teacher's token-level logit distribution. Dense supervision, tight feedback. The problem is structural: logit access requires a white-box teacher, and even when you have one, the signal is brittle. Token-level logit matching depends on the teacher and student sharing a narrow overlap of plausible next tokens. When that overlap is thin, the correction amplifies noise rather than signal, producing repetition loops and degenerate outputs.

OmniOPD replaces logit matching with a different kind of verification. Instead of asking "what probability does the teacher assign to each next token," it asks "does the teacher approve of this chunk of generated text." The mechanism runs Monte Carlo rollouts over multi-token chunks and scores them using continuous semantic similarity, producing a preference signal that doesn't require logit access at all. Think of it as replacing a microscope with a panel of judges: less resolution per token, but far more reliable signal over a meaningful span of output.

Two additional components stabilize the process. A peak-entropy scheduler identifies the student's high-uncertainty decision points, the reasoning forks where supervision is most useful, and concentrates auditing there rather than distributing it uniformly across all tokens. A Dirichlet-Multinomial Bayesian prior bounds the variance introduced by discrete sampling, and a base-model KL anchor prevents policy collapse in the tokens the scheduler skips. The result is dense-enough supervision without requiring the teacher to expose any internal state.

OmniOPD surpasses standard on-policy distillation by up to +28.64% on math benchmarks. When the teacher is upgraded from an open-weight model to a closed API, specifically Claude-4.5-Haiku or Gemini-2.5-Flash, the student gains an additional +9.54% relative over the open-weight baseline and moves past the performance ceiling of self-exploratory reinforcement learning. For teams distilling reasoning capability into smaller models, the takeaway is direct: proprietary API teachers are now viable, and the distillation signal they provide is measurably stronger than what open-weight alternatives deliver.

We're thinking: The framing here matters more than it might appear at first. We've been watching on-policy distillation treated as a technique only available to labs with full model access, which effectively meant it was unavailable to most product teams. OmniOPD breaks that assumption cleanly, but the more interesting implication is what it says about the token-level logit signal itself: the +28.64% gain over standard OPD suggests chunk-level semantic verification isn't just a workaround for missing access, it may be a genuinely better supervision signal. If that holds across domains beyond math, the field may have been over-indexing on logit fidelity while the more reliable learning signal was sitting at a coarser granularity the whole time.

Key takeaways:

OmniOPD replaces token-level logit matching with Monte Carlo chunk verification scored by semantic similarity, removing the white-box teacher requirement entirely while concentrating supervision at high-entropy reasoning forks via a peak-entropy scheduler.
The method beats standard on-policy distillation by up to +28.64% on math benchmarks; pairing with closed-API teachers adds a further +9.54% relative gain, though results are currently reported on math-heavy benchmarks and generalization to other reasoning domains remains to be confirmed.
Teams distilling reasoning capability into smaller models should treat closed-API teachers as first-class options rather than fallbacks: OmniOPD makes the distillation pipeline viable against any model that can evaluate text output, and the stronger the teacher, the larger the measured gain.

Source: OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification