Also Worth Noting - 2026-06-03

Five papers on making models faster, smarter, and more durable: from optimizer geometry to robot distillation to memory consolidation.

Also Worth Noting

02 [Theory] World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning Visual rollouts from world models look plausible far more often than they are task-correct, and treating them as ground truth sends planning in the wrong direction. The fix is to score each rollout with an MLLM before letting it influence the final answer, treating visual simulation as a hypothesis to be vetted rather than a verdict. Neither model needs retraining. For teams building planning pipelines, this filtering step is a low-cost way to get the spatial grounding of world models without inheriting their stochastic failure modes. link

03 [Training] Why Muon Outperforms Adam: A Curvature Perspective Muon trains large language models roughly twice as efficiently as Adam, and the geometric reason is now pinned down: Muon incurs a smaller second-order curvature penalty per step at matched validation loss. Both optimizers produce comparable first-order gains, but Muon's update direction aligns better with the local loss landscape, so each step loses less ground to curvature. Practitioners who have been waiting for a principled reason to switch optimizers rather than an empirical nudge now have one. link

04 [Inference] SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference Sparse attention does not actually solve the KV cache problem, and its selection step stays O(T²) even when the attention itself is sparse. SparDA breaks both bottlenecks by adding a fourth per-layer projection called the Forecast, which predicts which KV blocks the next layer will need, so selection and computation run in parallel rather than in sequence. This lookahead design is where the O(T²) selection cost finally collapses. Teams serving sequences above 32k tokens should watch this closely. link

05 [Application] Flash-WAM: Modality-Aware Distillation for World Action Models Standard step distillation collapses action accuracy in joint video-action models even when video quality metrics stay high, because video and action streams operate on different noise schedules with different marginal distributions. Treating them as a single modality during distillation means the action head absorbs the wrong gradient signal while video output looks fine. Flash-WAM applies separate distillation schedules per modality, recovering real-time control speeds without the silent action degradation that video-only evaluation would miss entirely. link

06 [Agent] Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories Standard fine-tuning on new temporal information degrades prior parametric knowledge at a rate that makes continual learning impractical past a few hundred updates. The Sleep paradigm addresses this by running periodic consolidation cycles that distill short-term in-context knowledge into long-term parameters while replaying prior distributions to prevent overwriting. Consolidation cuts that degradation significantly without requiring a full retraining pass. For teams building agents that must track a changing world over weeks or months, this is a more tractable path than context stuffing. link