Also Worth Noting - 2026-05-11

Auto-discovered TTS strategies, RL-trained retrievers, 300-task continual learning, exploit benchmarking, and cold-start model selection

Also Worth Noting

02 [Inference] LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling Hand-crafted test-time scaling recipes may already be obsolete. AutoTTS reframes the design problem: instead of tuning chain-of-thought budgets or beam search heuristics by intuition, practitioners define an environment and let an agent discover compute-allocation strategies automatically. The system explores the reasoning-pattern space that manual design leaves untouched, producing strategies that are model-specific rather than copied from prior work. Teams running inference-heavy pipelines should watch this closely , auto-generated TTS recipes could replace the current practice of borrowing scaling heuristics from unrelated model families. link

03 [RAG] Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training Single-step retrieval breaks on questions that require chaining multiple lookups, and fine-tuning a small LLM to orchestrate those steps is resource-intensive and locks out larger models. Q-RAG sidesteps both problems by framing retrieval as a value-function estimation task: the embedder itself learns when a retrieved chunk is sufficient and when search should continue. That shifts the multi-step logic into the retriever rather than into external orchestration code. Teams building complex QA pipelines can drop the prompt-chaining scaffolding and let the retriever handle continuation decisions directly. link

04 [Training] Scaling Continual Learning to 300+ Tasks with Bi-Level Routing Mixture-of-Experts Most continual learning benchmarks test fewer than 20 tasks , a regime that hides the catastrophic forgetting and capacity collapse that appear at scale. CaRE runs past 300 class-incremental tasks by combining a bi-level routing mechanism: a first-stage router activates relevant task-specific experts, while a second stage refines feature selection to preserve both stability and plasticity across the full sequence. That is the first result that makes continual learning plausible for production systems where models accumulate new classes over long operational lifetimes rather than in controlled lab splits. link

05 [Eval] ExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks? There is a hard gap between detecting a vulnerability and producing working exploit code, and current agents fail the latter far more often than headlines suggest. ExploitGym formalizes that boundary as a benchmark, requiring agents to move from a known CVE to concrete security impact such as unauthorized file access or code execution , a task demanding low-level memory reasoning and sustained multi-step progress. The results give security teams a calibrated threat model: AI exploitation capability is real but narrower than vague capability warnings imply, which matters for prioritizing defensive investment. link

06 [Open-source] ModelLens: Finding the Best for Your Task from Myriads of Models With over 800,000 models on HuggingFace, picking the right checkpoint for a new dataset is mostly guesswork when neither the model nor the dataset has prior benchmark records. ModelLens solves the cold-start case directly: its transferability estimation method requires no expensive per-model forward passes on the target dataset and does not assume a predefined candidate pool. That covers the scenario most real deployments actually face , a novel dataset arriving alongside a flood of new checkpoints, none of them benchmarked against each other. Teams doing model selection in production should treat this as a practical replacement for manual shortlisting. link