Also Worth Noting - 2026-05-14

gradient variance fixes, flawed SWE-bench scores, smarter VLA training, code-based RAG, and forward-only adaptation

Also Worth Noting

02 [Training] KL for a KL: On-Policy Distillation with Control Variate Baseline Single-sample Monte Carlo estimation is why on-policy distillation runs blow up unpredictably, and vOPD fixes this without adding a second forward pass. It reframes on-policy distillation as policy-gradient RL, then borrows the control variate baseline from that literature, deriving the value function in closed form so no extra model is needed. Gradient variance drops, training stabilizes, and the recipe becomes actually reproducible. Teams running OPD fine-tuning for reasoning models should treat this as a drop-in stabilizer before reaching for larger batch sizes or learning rate hacks. link

03 [Eval] AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation Roughly 1 in 55 passing SWE-bench Verified trajectories is a fluke where the patch accidentally satisfies tests without fixing the underlying bug. Across 2,614 OpenHands trajectories from eight model backends, 10.7% of passing runs in the analyzed subset show this "Lucky Pass" pattern, meaning binary pass/fail leaderboards systematically overstate real agent capability. AgentLens introduces process-level reference trajectories to separate principled solutions from chaotic trial-and-error. Anyone using SWE-bench rankings to make model selection decisions should weight this finding heavily. link

04 [Training] FrameSkip: Learning from Fewer but More Informative Frames in VLA Training Training VLA policies on every dense demonstration frame creates a supervision imbalance: long low-change segments dominate, while manipulation-critical transitions like contact and grasp appear only sparsely. FrameSkip scores each frame using action variation and visual-action coherence, then upweights the rare high-signal moments rather than treating all frames equally. Policy performance improves without collecting a single additional demonstration. For teams running teleoperation pipelines, this is a data-layer fix that costs nothing but a frame-selection pass before training begins. link

05 [RAG] Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation Free-form reasoning chains in multi-hop RAG let retrieval queries drift and errors compound invisibly, because the same model producing the chain also checks it. Representing multi-hop reasoning as executable programs instead makes every intermediate retrieval state explicit and independently verifiable, so failures localize to a specific step rather than propagating silently through prose. This structural shift cuts hallucination at intermediate hops and gives practitioners a concrete inspection point that natural-language chains cannot offer. If multi-hop question answering is in your pipeline, the diagnostic value alone justifies the switch. link

06 [Inference] FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation Adapting to labeled examples at inference time normally costs either backpropagation or O(n) context length growth. FAAST compiles labeled examples analytically into fast weights in a single forward pass, achieving constant-time inference overhead regardless of how many examples are provided. Across image classification and language modeling benchmarks it matches or beats backprop-based adaptation at a fraction of the compute. Teams that need to adapt to more than a few dozen labeled examples at test time will find the O(1) inference cost meaningfully different from in-context learning at scale. link