Also Worth Noting - 2026-05-08

Five papers on breaking default assumptions in generation order, expert allocation, credit assignment, reasoning ceilings, and training automation

Also Worth Noting

02 [Inference] Continuous Latent Diffusion Language Model Text generation does not require left-to-right order, and Cola DLM makes that concrete. The model separates global semantic planning from local token generation through a hierarchical structure: a Text VAE first maps text into a stable continuous latent space, then a diffusion process models the global semantic prior before any tokens are produced. This decoupling means fewer sequential steps are needed to reach comparable quality to autoregressive models. Teams exploring non-AR generation pipelines have a new architecture worth benchmarking against standard baselines. link

03 [Training] UniPool: A Globally Shared Expert Pool for Mixture-of-Experts Replacing a deeper MoE layer's learned top-k router with uniform random routing drops accuracy by only 1.0 to 1.6 points across production models, which suggests per-layer expert isolation is largely redundant. UniPool replaces the conventional per-layer expert sets with a single shared pool across all transformer layers, cutting expert-parameter growth that otherwise scales linearly with depth. Routing probes show deeper layers already reuse expert patterns established earlier in the network. Teams designing MoE architectures can treat per-layer isolation as a default worth questioning, not a requirement. link

04 [Agent] A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping Sparse trajectory-level rewards make it nearly impossible to tell which tool call in a multi-turn agent trace actually mattered. A2TGPO assigns per-turn credit using adaptive turn-level clipping derived from the policy's own predicted outcome probabilities, bypassing the need for a separate process reward model entirely. No external scorer, no tree-based rollout restructuring. For teams training agents on tool-use tasks where only final outcomes are labeled, this is a direct path to finer-grained credit assignment without added infrastructure. link

05 [Theory] Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key RL training on long-horizon reasoning tasks hits a ceiling set by model expressiveness, not by data volume or training steps. ScaleLogic, a synthetic logical reasoning framework, independently controls proof-planning depth and logic expressiveness, isolating which axis actually limits RL gains. When a model lacks the representational capacity to express the required logic, additional RL compute produces no improvement. The practical read: before scaling RL training for reasoning, verify the base model's expressiveness matches the task's logical demands. link

06 [Open-source] Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes Specialist agents running a closed empirical loop found training recipe improvements that human researchers had not tried, with full auditable logs of every hypothesis, code diff, and scored experiment. The system's output is not a single checkpoint but a trajectory of proposals and failure labels, which makes the results reproducible and inspectable. Lineage feedback, where agents condition new proposals on prior measured outcomes including failures, is the mechanism that keeps the search from cycling. Teams doing automated ML research should note the auditable trajectory format as a practical model for reproducibility. link