Also Worth Noting - 2026-05-19

Five papers tightening the token-efficiency loop across reasoning, training, agents, and optimization

Also Worth Noting

02 [Inference] Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models Waiting for the answer to stabilize is the wrong exit signal for chain-of-thought generation. Semantic convergence in the reasoning trace itself arrives at a cleaner stopping point: once successive reasoning steps stop changing meaning, the model has finished exploring, not just finished guessing. Triggering early exit on that semantic signal matches full-chain accuracy on math benchmarks while cutting token output, without the premature exits that confidence thresholds produce. Teams paying per-token on o1-style inference have a more principled kill switch. link

03 [Training] CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization Flat outcome rewards in RLVR treat every token in a correct rollout identically, wasting gradient signal on filler. CEPO conditions on the correct answer as a teacher signal to identify which tokens the model would have generated differently had it known the answer, isolating the decisive reasoning steps from grammatical padding. That contrastive token-level credit assignment gives the policy a sharper training target than binary pass/fail rewards can provide. Teams running RLVR fine-tuning on reasoning models should watch whether this token-level signal translates to faster convergence. link

04 [Agent] PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents Re-reading the same codebase or document corpus on every agent invocation is the dominant token sink in recurring long-context workloads, and no existing approach caches the structural knowledge that makes re-reading useful. PEEK builds a compact orientation map encoding what a context contains, how it is organized, and which entities have historically mattered, then reuses that map across invocations. Token consumption on repeated same-context tasks drops without degrading task performance. Directly applicable to any agent that repeatedly queries the same repository or knowledge base. link

05 [Inference] Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs In long-context multi-turn agentic pipelines, prefill dominates wall-clock time, not decoding, so uniform quantization across both phases misallocates the precision budget. Mix-Quant applies aggressive FP4 quantization only during prefill and keeps decoding at full precision, targeting the actual compute bottleneck rather than the cheaper phase. The result cuts prefill latency on agentic workflows while avoiding the accuracy degradation that full-pipeline quantization incurs. A straightforward phase-aware swap for teams already running quantized serving infrastructure. link

06 [Application] optimize_anything: A Universal API for Optimizing any Text Parameter A single LLM-based optimizer almost triples Gemini Flash's ARC-AGI accuracy (32.5% to 89.5%) and cuts cloud scheduling costs by 40%, without any domain-specific pipeline. The system frames any optimization problem as improving a text artifact evaluated by a scoring function, then supports single-task search, cross-problem transfer, and generalization to unseen inputs under one unified interface. Those results span prompt engineering, code optimization, and config tuning simultaneously. Teams maintaining separate specialized optimizers for each domain should weigh that overhead against this generalist baseline. link