Also Worth Noting - 2026-05-25

Safety monitoring for diffusion LLMs, agent lifespan decay, contamination detection, Shannon scaling, and FEA-driven CAD generation

Also Worth Noting

02 [Inference] $D^2$-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing Diffusion LLMs expose intermediate hidden representations across every denoising step that autoregressive monitors never see, and those states carry detectable safety signal before a final token is ever produced. D²-Monitor uses lightweight probes routed by hesitation signals in the denoising trajectory to catch harmful outputs earlier than final-token classifiers can. The framework is designed for always-on deployment, keeping compute overhead low enough to run continuously rather than as a spot check. Teams shipping D-LLMs into production should treat this as the first purpose-built safety monitoring baseline to evaluate against. link

03 [Agent] Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems Even with frozen model weights, deployed agents degrade over time as interaction history compresses, memory stores grow, and facts get revised after updates. Day-one benchmarks measure a snapshot of the base model, not the reliability of the full agent harness across weeks or months of operation. Lifespan becomes a systems property, not a model property, and standard evals are systematically blind to it. Teams running persistent agents should start tracking performance across deployment time, not just at initialization. link

04 [Eval] The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation Models trained on paraphrased benchmark data expose themselves when chain-of-thought is truncated to zero steps, because the memorized answer surface appears without the reasoning scaffold that would otherwise obscure it. Existing contamination detectors report these paraphrase-evaded models as clean, missing the signal that Zero-CoT truncation catches directly. The method targets the class of evasive contamination that malicious publishers use specifically to fool current detection pipelines. Any leaderboard evaluation that skips this check is likely accepting inflated reasoning scores at face value. link

05 [Theory] LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws Standard power-law scaling laws treat catastrophic overtraining and quantization-induced degradation as anomalies, because monotonic curves have no mechanism for performance to fall as compute rises. The Shannon Scaling Law maps model parameters to channel bandwidth and training tokens to signal power, borrowing directly from the Shannon-Hartley theorem, which makes non-monotonic degradation a predicted outcome rather than an unexplained outlier. The framework unifies overtraining and quantization failure under a single information-theoretic account. Practitioners sizing training runs near or past the compute-optimal point now have a theoretical prior for where the curve bends back down. link

06 [Application] Self-Improving CAD Generation Agents with Finite Element Analysis as Feedback Grading CAD outputs by geometric proximity to a gold reference produces models that look right but fail structural stress tests. This work closes that loop by using finite element analysis as the reward signal, so the agent iterates toward designs that pass real engineering criteria rather than visual similarity checks. The task requires producing a fully assembled multi-part STEP file, matching the industry-native workflow that prior synthesis-then-assembly pipelines split into disjoint steps. Teams building generative tools for mechanical engineering should treat FEA-in-the-loop as the evaluation standard that reference-proximity grading was always approximating. link