Also Worth Noting - 2026-05-10
Tool-gating waste, dedup ROI gaps, cheaper evolutionary search, and two distillation fixes for reasoning quality
Also Worth Noting
02 [Agent] LLM Agents Already Know When to Call Tools -- Even Without Reasoning Agents call tools reflexively even when the base model already has the answer, and no existing benchmark has measured the cost of that reflex until now. When2Tool spans 18 environments across three necessity categories -- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that draw a clear line between tool-necessary and tool-unnecessary tasks. Evaluations show agents overtrigger tool calls even on tasks well within their parametric knowledge. Teams tuning tool-gating policies now have a concrete benchmark to quantify and cut unnecessary API spend. link
03 [RAG] Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks Byte-exact chunk deduplication cuts RAG context by only 0.16% on clean academic corpora but by 80.34% in multi-turn conversational pipelines -- a 500x gap driven entirely by deployment type, not technique. The three-regime analysis covers 22.2M BeIR passages (near-zero reduction), constructed enterprise patterns (24.03% reduction), and multi-turn conversational AI (80.34% reduction), validated by a five-judge cross-vendor panel across four production APIs. Quality is preserved across all three regimes. The ROI calculation for adding dedup to a RAG pipeline should start with deployment classification, not benchmark numbers. link
04 [Training] LEVI: Stronger Search Architectures Can Substitute for Larger LLMs in Evolutionary Search AlphaEvolve-style evolutionary search overspends on frontier mutation models not because the tasks require it, but because archives fail to preserve solution diversity, forcing compensation through model strength. LEVI fixes this at the framework level: a diversity-aware archive, a routing layer that sends local edits to smaller models and reserves frontier calls for genuine novelty, and selective evaluation that skips redundant rollouts. Smaller models match frontier search quality at a fraction of the cost. Teams running LLM-guided program synthesis or algorithmic discovery should audit archive diversity before scaling model size. link
05 [Inference] On-Policy Distillation with Best-of-N Teacher Rollout Selection Standard on-policy distillation compounds errors by supervising a student on its own noisy trajectories while relying on a single stochastic teacher rollout per prompt -- a setup that frequently produces incorrect or uninformative supervision signals. Selecting the Best-of-N teacher rollout as the supervision target filters out low-quality traces before they reach the student, without touching reinforcement learning or adding external data. The method outperforms both standard SFT and vanilla on-policy distillation on reasoning benchmarks. Teams improving reasoning post-training without RL infrastructure have a cleaner signal path here. link
06 [Theory] Crosslingual On-Policy Self-Distillation for Multilingual Reasoning Math reasoning gaps in low-resource languages are not a capability ceiling -- models already hold the reasoning chains, just not in those languages. COPSD uses the same model as both student and teacher: the student sees only the low-resource problem, while the teacher receives the problem translation plus a high-resource reference solution as privileged context, generating supervision the student then internalizes. No external multilingual dataset is required. Teams deploying math or science assistants in low-resource language markets can close performance gaps without sourcing new training data. link