Also Worth Noting - 2026-06-17

Five papers on training signals, inference architecture, and efficiency trade-offs across diffusion LLMs, hybrid attention, and coding agents

Also Worth Noting

02 [Training] Learning from the Self-future: On-policy Self-distillation for dLLMs On-policy self-distillation breaks when applied to diffusion LLMs because left-to-right prefix conditioning directly conflicts with bidirectional masked generation. d-OPSD fixes this by redesigning both the self-teacher construction and the divergence supervision signal to match arbitrary-order generation, rather than patching the autoregressive formulation. This is the first OPSD framework built for dLLMs, opening a post-training path for a model class that previously had none. Teams experimenting with diffusion LLMs for reasoning or instruction-following should track this closely. link

03 [Inference] The Price of Anarchy in Disaggregated Inference Treating prefill and decode GPU pools as competing agents in a resource game produces measurable efficiency loss, and this paper is the first to quantify that loss formally as a price of anarchy. Three coupled games model the architecture: a two-player resource allocation game, a selfish KV cache game, and a request-routing congestion game with positive externalities, all validated against NVIDIA Dynamo. Selfish allocation by either pool degrades total system throughput even when per-pool utilization looks healthy. Infrastructure teams now have a formal argument for centralized scheduling rather than pool-local optimization. link

04 [Theory] A Gradient Perspective on RLVR Stability and Winner Advantage Policy Optimization GRPO collapse is not random: a token-level gradient taxonomy shows instability is predictable from the joint signal of advantage sign and the sharpness of the current token distribution. When that combination tips wrong, updates push entropy in the wrong direction and collapse follows. Winner Advantage Policy Optimization (WAPO) addresses this by restricting gradient updates to positive-advantage completions only, with a clipped objective. Because the failure mode is now detectable before collapse, teams running RLVR fine-tuning can instrument for it rather than diagnosing after the fact. link

05 [Agent] FastContext: Training Efficient Repository Explorer for Coding Agents A dedicated small exploration subagent cuts the token budget consumed before the solver model ever reads a relevant file, rather than relying on the solver to do its own repository search. FastContext issues parallel tool calls and returns concise file summaries on demand, keeping exploratory reads out of the solver's context history entirely. The architectural split means the solver's context stays clean from the first token. For coding agent pipelines currently bottlenecked on context pollution from grep-and-read loops, this is a concrete structural change worth evaluating. link

06 [Eval] Rethinking the Role of Efficient Attention in Hybrid Architectures Sliding-window attention degrades long-range dependency tasks even when full-attention layers are present elsewhere in the same stack, meaning SWA is not a free efficiency win in hybrid models. A systematic analysis across scaling behavior, mechanism analysis, and architecture design finds that efficient-attention choice primarily affects how fast long-context capability emerges during training, not whether it emerges at full scale. Recurrent mixers show a different failure profile, concentrated on tasks requiring precise token recall rather than range. Hybrid architecture designers should run task-specific long-context evals before treating SWA slots as interchangeable with recurrent alternatives. link