Also Worth Noting - 2026-05-28

Five papers on training dynamics, capacity limits, belief tracking, tool-use gaps, and video token compression

Also Worth Noting

02 [Training] Self-Improving Language Models with Bidirectional Evolutionary Search Best-of-N and tree search both get stuck because they only expand candidates forward from high-probability regions, leaving most of the solution space untouched. Bidirectional Evolutionary Search (BES) mutates candidates in both forward and backward directions, coupling autoregressive expansion with backward revision to escape local optima that sparse verification signals can't distinguish. The result is a self-improvement loop that isn't bottlenecked by the model's existing probability mass. Teams running post-training pipelines where best-of-N plateaus early should treat this as a direct alternative. link

03 [Theory] How LoRA Remembers? A Parametric Memory Law for LLM Finetuning LoRA rank and layer depth jointly cap how many facts a model can reliably store, and that ceiling follows a quantifiable power law. The Parametric Memory Law gives practitioners a formula to predict storage capacity before training starts, not after a fine-tuned model fails to retain new knowledge in production. Sizing adapters has been mostly empirical trial-and-error; this replaces that with a principled pre-training calculation. Teams doing continual knowledge updates via LoRA should run the capacity estimate before committing to a rank configuration. link

04 [Eval] When Should Models Change Their Minds? Contextual Belief Management in Large Language Models Standard long-context benchmarks reward final-answer accuracy and miss a more damaging failure: models frequently update their beliefs on task-irrelevant noise while ignoring formal evidence that should trigger a genuine update. BeliefTrack isolates this by introducing a closed-world benchmark across Rule Discovery and Circuit Diagnosis tasks, where a finite belief space and symbolic verifiers enable exact turn-level evaluation rather than end-state scoring. Three distinct failure modes emerge: updating on noise, failing to update on evidence, and failing to hold a stable state. Any team evaluating long-horizon agents on existing benchmarks is likely missing at least one of these failure modes entirely. link

05 [Agent] Agent Explorative Policy Optimization for Multimodal Agentic Reasoning Tool calls are structurally penalized during standard RL training because they introduce high variance relative to self-contained reasoning steps, so models learn to avoid them even when a tool would return the correct answer. Under GRPO, tool use appears in only roughly 30% of rollouts, and tool-using rollouts carry disproportionate variance that the reward signal treats as noise rather than signal. This Thinking-Acting Gap explains why vision-language models with strong chain-of-thought scores still underperform on tool-use tasks. Teams training multimodal agents with GRPO or similar recipes should audit tool-call frequency in rollouts before attributing poor performance to model capability. link

06 [Inference] EarlyTom: Early Token Compression Completes Fast Video Understanding Most video token compression methods wait until late in the prefilling stage, leaving the vision encoder itself as an unaddressed compute bottleneck. EarlyTom compresses visual tokens at the encoder stage before they reach the LLM backbone, preserving spatial structure that late-compression methods discard when they prune mid-attention. That structural preservation recovers 2-4 points on temporal reasoning benchmarks at the same token retention budget. For teams deploying video LLMs where time-to-first-token is the binding constraint, moving compression upstream is the more direct fix. link