Dense Teacher Supervision Breaks Multi-Turn Agents. SDAR Fixes It.

SDAR adds a sigmoid-gated distillation layer on top of RL, lifting agent performance by up to 10.2% over GRPO across ALFWorld, WebShop, and Search-QA.

Adding richer supervision to RL training sounds like a straightforward win. When that supervision comes from a teacher branch that sees context the deployed agent never will, the signal corrupts rather than guides.

On-Policy Self-Distillation (OPSD) was designed to solve a real problem: trajectory-level RL rewards are coarse. A single scalar at the end of a 20-step interaction tells the model almost nothing about which token choices were good and which were wasteful. OPSD addresses this by running a privileged teacher branch alongside the agent, generating dense token-level guidance that the student can imitate at every step. In single-turn or short-horizon settings, this works. The teacher sees enough of the same context that its token preferences are meaningful.

Multi-turn agents break that assumption. Over a long interaction, the agent accumulates context that the teacher branch, operating with privileged but asymmetric information, does not fully share. The teacher's token-level endorsements begin to reflect a different problem than the one the agent is actually solving. Errors compound across turns. Worse, skill retrieval failures inside the teacher produce negative rejection signals that carry no useful gradient information, yet naive GRPO+OPSD treats them symmetrically with genuine positive guidance. The result is training instability, not acceleration.

SDAR (Self-Distilled Agentic Reinforcement Learning) keeps RL as the primary optimization backbone and treats OPSD as a conditional auxiliary rather than a co-equal objective. The mechanism is a sigmoid gate applied to detached token-level distillation signals. For tokens where the teacher endorses the agent's direction (positive-gap tokens), the gate opens and distillation loss flows through at full strength. For tokens where the teacher rejects, the gate attenuates the signal softly rather than zeroing it or passing it unchanged. The design choice matters: hard zeroing discards potentially recoverable signal; unconstrained negative flow destabilizes training. The sigmoid creates a smooth interpolation that scales with confidence in the teacher's judgment.

Think of it as a credibility filter on a co-pilot. When the co-pilot's read of the situation aligns with what the pilot is seeing, their input carries full weight. When the co-pilot is working from a different instrument panel, their corrections are acknowledged but discounted before reaching the controls.

Across the Qwen2.5 and Qwen3 model families tested on ALFWorld, WebShop, and Search-QA, SDAR beats GRPO by 9.4 percentage points on ALFWorld, 10.2 points on WebShop accuracy, and 7.0 points on Search-QA. Those numbers hold across model scales and consistently outperform hybrid RL-OPSD baselines that apply distillation without gating. The instability that appears in naive GRPO+OPSD does not appear in SDAR. For teams training LLM agents on long-horizon interactive tasks, the takeaway is direct: the failure mode here is not RL itself but unfiltered dense supervision from a teacher operating under asymmetric context, and gating that supervision by token-level confidence resolves it without abandoning the distillation benefit.

We're thinking: The deeper issue SDAR surfaces is one that will recur everywhere OPSD-style methods get applied to agentic settings: the teacher branch's privileged context is a training-time artifact that has no equivalent at deployment. We find this worth watching not just as a training trick but as a structural warning. Any dense supervision scheme that relies on a teacher seeing more than the agent will see in production is implicitly training the agent to imitate a capability it cannot reproduce. SDAR's sigmoid gate is a pragmatic fix, but the broader design principle, treating privileged-context distillation as a gated auxiliary rather than a direct objective, should probably become the default assumption for anyone building multi-turn agent training pipelines.

Key takeaways:

SDAR maps OPSD's token-level distillation signals through a sigmoid gate, strengthening positive-gap tokens and softly attenuating negative teacher rejections, keeping RL as the primary objective rather than sharing optimization weight with an unstable auxiliary.
SDAR beats GRPO by 9.4% on ALFWorld, 10.2% on WebShop, and 7.0% on Search-QA across Qwen2.5 and Qwen3 scales; the caveat is that all three benchmarks are relatively structured environments, and the gating behavior on noisier or open-domain tasks remains untested.
Teams training RL agents on multi-turn tasks should audit whether their dense supervision source has asymmetric context access; if it does, gating distillation by token-level confidence rather than applying it uniformly is the direct fix SDAR demonstrates.

Source: Self-Distilled Agentic Reinforcement Learning