Also Worth Noting - 2026-05-21
KV cache pressure attacked from four angles, plus a privacy-utility fix for personal-assistant agents
Also Worth Noting
02 [Inference] WorldKV: Efficient World Memory with World Retrieval and Compression Full KV-cache attention keeps video worlds consistent across revisits but scales memory linearly with rollout length, killing real-time throughput. WorldKV sidesteps that wall without retraining: a retrieval component surfaces relevant past context on demand, while a compression component prunes what no longer needs full resolution. The result is persistent-world consistency at interactive frame rates. Anyone building action-conditioned world models now has a training-free path that doesn't force a choice between memory and speed. link
03 [Inference] OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond Standard per-channel KV quantization degrades sharply at extreme compression, and the root cause is Token Norm Imbalance rather than channel-wise outliers alone. OScaR reframes the problem at the tensor level, treating norm imbalance as the primary bottleneck and correcting it before quantization runs. The approach opens a concrete path to sub-2-bit KV cache without accuracy collapse. For teams running long-context or multimodal workloads at scale, this reframing changes where to look when quantization quality falls apart. link
04 [Theory] Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention Delta-rule linear attention uses a single scalar gate to control both how much old content to erase and how much new content to write, and that conflation is the core failure mode on associative recall tasks. Gated DeltaNet-2 separates the two operations, giving the model independent control over the key-side erase and the value-side write. Decoupling alone produces measurable gains on the exact benchmarks where linear attention has historically fallen short of softmax. That failure mode has been the main argument against deploying linear transformers in production, making this a result worth tracking. link
05 [Training] The Distillation Game: Adaptive Attacks & Efficient Defenses Treating model distillation as a minimax game between a utility-constrained teacher and an adaptive student yields a specific, tractable defense: suppress the outputs that carry the highest signal value for imitation, not outputs at random. The teacher-side defense, called Product-of-Experts, requires only a forward pass and a cheap proxy for example value. Critically, it reduces distillation signal without degrading general utility, resolving the trade-off that has made API-level defenses hard to deploy. Model API operators now have a decision framework with concrete response rules rather than heuristics. link
06 [Application] It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs Frontier models make systematic errors on context-appropriate disclosure decisions, not just factual ones, and existing mitigations recover privacy only by degrading task performance. SELFCI decouples information suppression from task execution using complementary self-distillation, training two specialized branches from the base model without any external labeled privacy data. The framework closes the privacy-utility gap that has made deploying personal-assistant agents on sensitive workflows a liability. For teams building agents that handle medical, legal, or financial workflows, this is a direct path to CI compliance that doesn't require curating a privacy dataset from scratch. link