Also Worth Noting - 2026-06-05

Agent safety blind spots, fragile arithmetic geometry, and a unified fix for broken genomic benchmarks

Also Worth Noting

02 [Agent] The Cold-Start Safety Gap in LLM Agents Tool-calling agents are most vulnerable to unsafe behavior at the very start of a session, before any task context has accumulated. The SODA benchmark controls how many regular agentic tasks precede a safety threat, and across 7 models from 4 families, safety improves by 9 to 52% as prior task count rises. That cold-start window is invisible to most current safety evals, which test at arbitrary conversation depth. Teams deploying agents in production should treat first-turn interactions as a distinct, higher-risk surface. link

03 [Eval] SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces Coding agents can cause serious workspace damage through action sequences where each individual step looks benign. SABER evaluates safety from the final environment state after a full action sequence, not from whether the model refused a prompt, and categorizes violations by cause. Refusal-rate metrics miss this class of harm entirely. Any team using prompt-refusal benchmarks to certify coding agent safety is measuring the wrong thing. link

04 [Theory] The Shape of Addition: Geometric Structures of Arithmetic in Large Language Models LLMs represent multi-operand addition through continuous carry fibers anchored to semantic digit positions, not discrete symbolic steps. The Iso-Raw-Sum Trajectory structure in the residual stream shows that arithmetic errors are geometric slippages caused by internal neural noise pushing a latent carry potential across quantization boundaries. This geometry makes failures that look random structurally predictable. Prompt perturbations that shift digit-position anchors are the most likely culprit when arithmetic breaks unexpectedly. link

05 [Eval] GENEB: Why Genomic Models Are Hard to Compare Claims of superiority across genomic foundation models are largely artifacts of incompatible eval protocols, not real capability differences. GENEB evaluates frozen representations from 40 genomic foundation models across 100 tasks in 13 functional categories under a single unified probing protocol, including few-shot regimes, making controlled comparisons possible for the first time. Any team selecting a genomic foundation model based on published leaderboards is likely picking on noise rather than signal. link

06 [Inference] SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating Deep research agents reach the same answers with far fewer steps when redundant tool calls are penalized during training rather than filtered at inference. SlimSearcher uses adaptive reward gating to push agents toward shorter, sufficient trajectories without sacrificing accuracy, cutting trajectory length and token consumption in web research tasks. The efficiency gains transfer directly to serving time, reducing API and compute costs without any inference-time modification. link