Also Worth Noting - 2026-05-12

Five papers on compressing, querying, and managing LLM infrastructure more efficiently than current defaults assume

Also Worth Noting

02 [Training] SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training Pruning beats random initialization for MoE pretraining, but distillation gains shrink as scale grows, meaning the two techniques are not interchangeable. The finding inverts a common assumption that distillation compounds cleanly on top of structural pruning at any scale. Expert compression choices made early in pretraining propagate through continued training in ways that dominate final model quality. Teams compressing MoE models should sequence pruning before distillation and treat expert count decisions as load-bearing, not cosmetic. link

03 [Eval] Just Ask for a Table: A Thirty-Token User Prompt Defeats Sponsored Recommendations in Twelve LLMs A 30-token user prompt asking for a comparison table cuts sponsored-flight selection rates to near zero across 12 tested models, including GPT-4o, with no fine-tuning required. The mechanism is structural: tabular output forces the model to surface all options symmetrically rather than collapse to a single recommendation that a soft sponsorship cue in the system prompt can steer. The fix works across both open-weight and proprietary models under three different judges. Any team auditing LLM-powered commerce surfaces should add this prompt pattern to their evaluation baseline immediately. link

04 [RAG] Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient? BM25 paired with a strong frontier LLM in an agentic loop matches or beats dense-retrieval baselines on deep-research tasks, provided retrieval depth is sufficient. The key variable is not retrieval modality but the reasoning and tool-use quality of the LLM coordinating the loop. Pi-Serini, the released search agent, equips BM25 with retrieve, browse, and read tools to cover the gaps lexical matching leaves. Teams building agentic pipelines may be over-investing in embedding infrastructure when a well-tuned lexical retriever with adequate recall depth delivers equivalent downstream quality. link

05 [Inference] Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction Evicting low-relevance KV entries in long contexts improves answer quality over full-cache baselines, not just throughput, because irrelevant tokens actively dilute attention away from useful evidence at sequence lengths above roughly 32k. Full-cache attention is not a quality ceiling; it is a source of noise at scale. A global retention-based eviction method learns each token's future utility and removes entries that would otherwise pull attention mass away from high-signal positions. For teams serving long-context workloads, selective KV eviction is worth evaluating as a quality intervention, not only a memory optimization. link

06 [Agent] Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning Static skill accumulation in LLM agents causes interference and performance drift over long task horizons, not just library bloat. The assumption that skills should either persist indefinitely or be fully internalized ignores the uneven marginal contribution of individual skills across tasks and stages. SLIM, the proposed lifecycle policy, selectively retires or internalizes skills based on task context, keeping the active skill set non-monotonic and bounded. Teams running multi-step agentic RL should treat skill eviction as a first-class design decision alongside skill acquisition. link