Also Worth Noting - 2026-06-11
Five papers on cutting wasted compute: smarter MoE routers, leaner GUI agents, hidden model dependencies, RL rollout allocation, and subquadratic architecture selection.
Also Worth Noting
02 [Training] Redesign Mixture-of-Experts Routers with Manifold Power Iteration Router rows in MoE models are rarely designed to actually encode their expert's behavior, so dot-product similarity scores are a poor proxy for true token-expert affinity. Manifold Power Iteration fixes this by aligning each router row with the principal singular direction of its associated expert matrix, reshaping the existing weights rather than adding new parameters. The alignment cost is low enough to apply during training without meaningful overhead. Teams running production MoE models with load-balance problems should treat this as a low-friction architectural patch before reaching for more expensive routing schemes. link
03 [Inference] ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction Consecutive screenshots in GUI interaction trajectories are mostly identical, yet every frame is fully re-encoded, burning context budget on redundant visual tokens. ReVision trains multimodal language models on trajectories where redundant visual patches across frames are removed, so agents can fit substantially longer interaction histories under a fixed context window. Longer history access is what unlocks multi-step task improvement, a gain that vanilla computer-use agents have consistently failed to show. For teams building GUI automation pipelines, this is a direct path to better multi-step success without scaling the context window itself. link
04 [Eval] Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs Most LLMs silently inherit choices from upstream models used to generate training data, filter corpora, or judge outputs, and those upstream models carry their own undocumented dependencies. ModSleuth reconstructs these dependency graphs by recursively tracing public artifacts, revealing that the full chains are rarely documented in any single release and are often circular. The recursive depth outpaces anything a practitioner can trace manually. Any team publishing a model or auditing one for compliance should run dependency reconstruction before claiming provenance, because the implicit graph is almost certainly deeper than the model card suggests. link
05 [Agent] TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning In agentic RLVR training, prompts that are too easy or too hard both produce near-zero reward variance, meaning most rollout compute yields no useful gradient signal. TRACE dynamically reallocates rollout budget away from low-contrast prompts at both the prompt level and the individual decision-step level within multi-turn rollouts, targeting compute where reward contrast is actually high. The result is measurably fewer wasted rollouts and better sample efficiency on agentic tasks where outcome-only rewards assign identical terminal scores to every step. Teams running RLVR on tool-use or multi-step reasoning tasks should consider step-level contrast as a first-class allocation signal, not just prompt-level filtering. link
06 [Theory] On Subquadratic Architectures: From Applications to Principles Across code pre-training, knowledge distillation from large LLMs, and time-series foundation modeling, xLSTM does not consistently win against its subquadratic rivals. Gated DeltaNet, which combines selective state updates with a delta-rule memory write mechanism, delivers the strongest performance on tasks with complex long-range dependencies. xLSTM leads in some settings but not the ones where dependency structure is most demanding. Practitioners evaluating alternatives to transformer attention for code or structured sequence tasks should treat Gated DeltaNet as the current benchmark to beat, not Mamba-2. link