Also Worth Noting - 2026-06-06

Audit gaps, poisoned skills, and structural ViT instability: five papers on hidden failure modes in AI systems

Also Worth Noting

02 [Eval] When Behavioral Safety Evaluation Fails: A Representation-Level Perspective A model can pass every behavioral safety check while remaining fully compromised at the representation level. This paper formalizes that gap as the audit gap, constructing "dissociated models" that preserve safe outputs while retaining latent-space vulnerability, then measuring robustness through soft interventions rather than output inspection alone. The finding breaks a core assumption in red-teaming practice: pass rates on behavioral evals are not a reliable proxy for deployment safety. Teams shipping safety-critical LLMs should treat behavioral benchmarks as a floor, not a certificate. link

03 [Agent] POISE: Position-Aware Undetectable Skill Injection on LLM Agents Skill-poisoning attacks on LLM agents can execute a malicious payload while the user's legitimate task still passes its verifier in the same trial. POISE achieves this by being position-aware: the injected behavior fires at the right point in the task trajectory so the failure signal that would invite inspection never appears. Prior attacks faced a reliability-stealth tradeoff; POISE resolves it by measuring success as Attack Success Rate only when both the payload executes and the user task passes. Monitoring for task errors is not sufficient to catch skill-level compromise. link

04 [Training] Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses Heuristic reflection loops in LLM agent harnesses overfit to raw success counts, treating lucky runs the same as reliable ones. Bayesian-Agent replaces that with posterior inference: skills and SOPs are treated as hypotheses about whether a frozen model will succeed under a specific prompt, context, and harness environment, and verified trajectories update a calibrated belief rather than a tally. Skill revisions become sensitive to uncertainty, not just frequency, which reduces the chance of promoting a skill that succeeded by coincidence. Teams running multi-step agent harnesses with reflection loops have a principled alternative to count-based logging here. link

05 [Inference] Chiaroscuro Attention: Spending Compute in the Dark Most tokens in a long sequence carry low relational signal and do not need full self-attention, yet standard transformers apply it uniformly anyway. CHIAR-Former routes each token to one of three operators, DCT spectral mixing, RBF kernel mixing, or full self-attention, based on per-token spectral entropy as a complexity signal. Ablations on WikiText-103 expose routing collapse: RBF is consistently rejected in favor of DCT and attention, suggesting spectral mixing covers most of the low-entropy workload. For long-context inference where the majority of tokens are static or repetitive, this routing approach could meaningfully cut FLOPs without touching model weights. link

06 [Theory] Phase Marginalization for Patch-Grid Instability in Vision Transformers Shifting the patch partition in a Vision Transformer changes which tokens a boundary pixel sees, introducing phase-dependent instability that is structural, not a training artifact. Phase Marginalization treats patch-grid phase as a nuisance variable, evaluates multiple structured phase offsets, inverse-aligns the dense outputs, and aggregates them in the original image coordinate system. The central variant, Uniform Phase Marginalization with K = 4, is training-free and improves over the canonical single-phase baseline without any fine-tuning. For teams running ViT-based dense prediction like segmentation or depth estimation, this is a post-hoc fix that costs only extra forward passes at inference time. link