Also Worth Noting - 2026-05-22
Five papers tightening the gap between benchmark scores and production reality, from kernel generation to agent evals.
Also Worth Noting
02 [Inference] Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention A single scalar gate controlling both erase and write is quietly scrambling memory in delta-rule linear attention models. Gated DeltaNet-2 fixes this by splitting the operation into two independent gates: one governing how much old content to remove on the key side, another controlling how much new content to write in. The separation lets the model edit its fixed-size recurrent state without corrupting existing associations, which is the failure mode that has capped prior linear attention at long context. Teams evaluating linear attention for long-context inference should treat this architectural split as a baseline requirement going forward. link
03 [Theory] LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws Standard power-law scaling laws predict monotonic improvement with compute, yet catastrophic overtraining and quantization-induced degradation both show performance falling as compute rises. The Shannon Scaling Law reframes LLM training as information transmission over a noisy channel, mapping model parameters to channel bandwidth and training tokens to signal power, which makes a performance ceiling at high compute a theoretically expected outcome rather than an empirical anomaly. Overtraining is not a tuning mistake; it is what happens when signal power saturates channel capacity. Teams planning long training runs should factor this ceiling into their compute budgets before hitting it. link
04 [Eval] FastKernels: Benchmarking GPU Kernel Generation in Production Every existing GPU kernel generation benchmark is measuring something that does not ship. Current evals run on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward agents for replicating known optimizations, producing kernels that score well in sandboxes but introduce interface incompatibilities and silent correctness failures in production inference stacks. FastKernels realigns evaluation against real production frameworks, exposing that leaderboard rankings reflect sandbox performance, not deployment readiness. Teams using LLM-based kernel generation agents should validate against their actual compilation stack before trusting any published score. link
05 [Agent] TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks Hand-crafted shell benchmarks systematically omit the long-tail workflows that define real terminal use. TerminalWorld reverse-engineers 80,870 in-the-wild terminal recordings into 1,530 validated tasks spanning 18 categories, including multi-step sequences exceeding 50 steps and 1,280 unique commands, then curates a Verified subset of 200 manually reviewed tasks. Because tasks are derived from actual human workflows rather than designed from scratch, coverage of rare command combinations and extended pipelines comes for free. Agent researchers benchmarking on synthetic shell evals should treat TerminalWorld-Verified as a calibration check on whether their results hold against real usage patterns. link
06 [Application] Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning Prompt-engineered agents handle simple spreadsheet operations but fall apart on the multi-step formula and data-manipulation workflows that define real enterprise use. Spreadsheet-RL applies reinforcement learning directly in live Excel and Google Sheets environments, letting agents learn from task-completion signals across realistic workflows rather than from curated demonstrations. RL-trained agents outperform prompt-engineered baselines on complex, multi-step tasks, which is the first credible evidence that RL adds measurable value specifically for structured office automation. Teams building enterprise automation agents should consider RL fine-tuning as a viable path for spreadsheet workflows rather than relying on prompting alone. link