Also Worth Noting - 2026-06-04
Five papers spanning robot video physics, memorization risk gaps, agent safety blind spots, diffusion compute waste, and a catalog of 63 budget-overrun failures
Also Worth Noting
02 [Application] Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation? Video generation models that look physically plausible on screen often fail the moment their depicted motions are handed to a real robot arm. Dream.exe uses manipulation success rate as a ground-truth physics probe, testing whether generated motion sequences translate into executable behavior rather than just visually coherent frames. The framework exposes which architectures produce motions that are physically impossible despite looking correct, a distinction that pure visual quality metrics cannot catch. Teams evaluating video models for embodied or robotics applications now have a concrete benchmark beyond perceptual scores. link
03 [Eval] LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs Models forced to reproduce training data under adversarial prefix attacks rarely do so during ordinary use, meaning current memorization risk scores conflate capability with actual behavior. PropMe introduces a propensity-aware framework that contrasts prefix-based extraction attacks against non-adversarial evaluations, then applies a metric transformation to existing memorization functions to produce propensity scores. The gap between what a model can be made to leak and what it spontaneously leaks turns out to be large. Privacy risk assessments built entirely on adversarial extraction overstate real-world exposure. link
04 [Agent] BraveGuard: From Open-World Threats to Safer Computer-Use Agents Single-step safety filters miss a whole class of computer-use agent attacks where every individual action looks benign but the execution sequence causes harm. BraveGuard builds a self-evolving defense framework that mines open-world threat signals and realistic multi-step agent trajectories to train guard models against these trace-level attacks. The framework continuously updates from emerging research sources, so its threat coverage grows as attack patterns evolve. Teams deploying agents over files, terminals, or browsers should treat trace-level evaluation as a separate safety layer from prompt-level filtering. link
05 [Inference] Complexity-Balanced Diffusion Splitting Spending equal compute on every diffusion timestep wastes roughly the same FLOPs that scaling papers spend to recover, because early noisy steps and late refinement steps differ sharply in signal complexity. Complexity-Balanced Splitting (CBS) distributes the generative workload across multiple specialized sub-networks, each assigned to a band of timesteps calibrated by function approximation theory and de Boor's equidistribution principle. Harder timesteps get more capacity; easier ones get less. Practitioners running diffusion inference at scale can treat CBS as a principled alternative to monolithic architecture scaling. link
06 [Open-source] Token Budgets: An Empirical Catalog of 63 LLM-Agent Budget-Overrun Incidents, with an Affine-Typed Rust Mitigation as a Case Study Retry loops and delegation chains account for the two dominant failure modes across 63 confirmed production budget-overrun incidents drawn from 21 orchestration frameworks between 2023 and 2026. Each incident in the catalog is backed by a quoted GitHub issue and, where reported, a dollar loss, organized into eight failure clusters. The paper also proposes the first type-system-level mitigation: affine-typed Rust enforcement that makes cost-bearing values non-aliasable and non-reusable by construction. Engineers building agent runtimes now have a failure taxonomy and a concrete language-level approach for cost containment. link