Also Worth Noting - 2026-05-20

Five papers on training efficiency, reward modeling, and unified optimization spanning RLVR geometry to 80%-cheaper text-to-image

Also Worth Noting

02 [Training] You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories RLVR weight updates compress almost entirely into a single rank-1 direction, and that direction grows near-linearly with training steps. That geometry means the downstream reasoning gains from a full training run can be predicted and replicated by extrapolating a handful of early checkpoints rather than running to completion. Teams spending compute on long RLVR fine-tuning runs should check whether a rank-1 extrapolation from their first few checkpoints already closes most of the gap. link

03 [Eval] Process Rewards with Learned Reliability Standard process reward models hand downstream search a single score per reasoning step with no signal about whether that score should be trusted. BetaPRM replaces the point estimate with a Beta distribution fitted over Monte Carlo continuation outcomes, encoding both a success probability and the reliability of that estimate. Beam search and MCTS can then down-weight steps where the model is uncertain rather than treating all scores as equally authoritative. Teams using PRMs for test-time compute scaling should find this a direct drop-in improvement over single-score baselines. link

04 [Agent] PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents Agents working repeatedly over the same document corpus or code repository re-read and re-orient to that context on every invocation, paying the same processing cost each time. PEEK separates orientation knowledge , what the corpus contains, how it is organized, which entities and schemas matter , from task-specific trajectory, caching it as a reusable map across invocations. For code-repo and document agents with recurring same-context workloads, this translates directly to lower latency and reduced token spend without sacrificing task performance. link

05 [Training] Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models A 3.8B-parameter text-to-image model matches or beats 6B+ competitors while using only 19.3% of the training compute required by Z-Image. The efficiency comes from two choices: training on 800M densely recaptioned image-text pairs to maximize information density per batch, and a compact architecture that avoids over-parameterization. The gap between Lens and larger models on standard benchmarks suggests current large T2I models carry significant wasted capacity, and that data quality engineering outweighs raw parameter count. link

06 [Application] optimize_anything: A Universal API for Optimizing any Text Parameter A single LLM-based optimization loop, framed as iterative improvement of a text artifact scored by any callable function, matches specialized tools across six distinct domains without per-task engineering. The same system finds agent architectures that lift Gemini Flash's ARC-AGI accuracy from 32.5% to 89.5% and scheduling algorithms that cut cloud costs by 40%. That breadth challenges the assumption that prompt optimization, hyperparameter search, and code tuning each require purpose-built tooling. link