Also Worth Noting - 2026-05-09

Full-duplex interaction, few-step diffusion, agentic RL credit assignment, memory reranking, and MARL eval gaps

Also Worth Noting

02 [Inference] MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction Turn-taking is the hidden bottleneck in multimodal AI interaction, not latency or modality coverage alone. MiniCPM-o 4.5 eliminates the alternating perception-response cycle by processing audio, video, and text simultaneously in a continuous streaming loop, bringing reported interaction latency below 200ms. The model also shifts from purely reactive responses to proactive behavior, acting without waiting for explicit user prompts. Teams building real-time voice or video assistants should treat this architecture as a reference point for what full-duplex omni-modal pipelines can look like at small model scale. link

03 [Training] Continuous-Time Distribution Matching for Few-Shot Diffusion Distillation Sparse supervision at a handful of discrete timesteps is what causes vanilla Distribution Matching Distillation to produce visual artifacts and over-smoothed outputs. Replacing those fixed checkpoints with a continuous-time distribution matching objective covers the full trajectory without imposing ODE path constraints, closing the quality gap to consistency methods in 1-4 step generation. The reverse KL mode-seeking problem that plagued discrete DMD is substantially reduced. Teams distilling image generation models for low-step inference have a cleaner theoretical path here than patching discrete-timestep pipelines with auxiliary losses. link

04 [Agent] StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction Purely reactive RL fine-tuning plateaus on tasks exceeding roughly 10 steps because credit assignment across a long trajectory is nearly impossible without a higher-level plan. StraTA fixes this by sampling a compact strategy representation from the initial task state, conditioning all subsequent actions on that strategy, and training the strategy generator jointly with the action policy. The explicit trajectory-level abstraction gives the reward signal a shorter path back to the decisions that mattered. For teams running agentic RL on multi-step tool-use or planning benchmarks, adding a strategy scaffold before policy optimization is a low-cost intervention with measurable gains. link

05 [RAG] MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval Semantic similarity reranking retrieves memories that look relevant but lack the specific facts needed to answer reasoning-dependent queries, a miscalibration that makes threshold-based filtering unreliable in production agent memory systems. MemReranker replaces similarity scoring with a reasoning-aware model that evaluates whether a retrieved memory actually supports answering the question, not just whether it shares surface vocabulary. Top-1 recall on reasoning-dependent queries improves by over 15 points against BM25 and dense baselines on their benchmark. Teams using retrieve-then-rerank pipelines in long-term agent memory should audit how much of their retrieval failure is a reranker reasoning gap rather than a retrieval coverage gap. link

06 [Eval] Coordination Matters: Evaluation of Cooperative Multi-Agent Reinforcement Learning Two cooperative MARL systems can post identical win rates while differing in coordination quality scores by 40%, meaning aggregate return metrics are measuring outcomes, not whether agents actually coordinate. Standard benchmarks collapse all joint assignment behavior into a single scalar, hiding whether performance came from genuine coordination or from individually greedy policies that happened to avoid collision. The STAT testbed introduced here supplements return with process-level diagnostics that vary agent count, task count, and commitment constraints in a controlled way. Teams publishing or comparing cooperative MARL results should run coordination-aware diagnostics before treating leaderboard rankings as evidence of learned joint behavior. link