Also Worth Noting - 2026-05-16

Agentic memory, eval gaps, and open infrastructure: five papers tightening the loop between benchmarks and deployment reality

Also Worth Noting

02 [Agent] HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution Flat vector stores treat every memory edge as equally valid at retrieval time. HAGE replaces that assumption by organizing memory as relation-specific graph views over shared nodes, then using RL to continuously reweight edges based on query-conditioned confidence rather than fixing weights at write time. Retrieval becomes a sequential traversal problem, not a lookup problem. Teams running production agentic pipelines where relationship strength shifts across sessions should treat this as a concrete architectural alternative to static vector memory. link

03 [Training] Many-Shot CoT-ICL: Making In-Context Learning Truly Learn Performance gains from scaling chain-of-thought in-context examples plateau earlier on reasoning tasks than on non-reasoning tasks, which sets a practical ceiling that existing many-shot scaling intuitions miss. Standard many-shot rules derived from non-reasoning benchmarks do not transfer once chains of thought are involved, regardless of whether the underlying model is reasoning-oriented or not. That plateau means prompt-only adaptation has a lower ceiling for reasoning work than the long-context scaling narrative implies. Teams deciding between many-shot prompting and fine-tuning for reasoning tasks now have a cleaner signal for where the crossover point sits. link

04 [Eval] WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation Agents that score well on synthetic sandboxes fail measurably on WildClawBench's 60 human-authored, bilingual, multimodal CLI tasks, each averaging roughly 8 minutes of wall-clock time and more than 20 tool calls. The gap exposes a structural mismatch: existing benchmarks use mock APIs and final-answer checks, which do not capture the failure modes that accumulate over long-horizon real-runtime execution. Current eval suites are systematically overestimating deployment readiness for command-line agents. Any team shipping CLI-facing agents should run WildClawBench before treating existing benchmark scores as a proxy for production behavior. link

05 [Open-source] Orchard: An Open-Source Agentic Modeling Framework Most open-source agent frameworks stop at orchestration and evaluation, leaving teams to wire in proprietary training infrastructure to reproduce state-of-the-art agentic RL results. Orchard closes that gap by integrating both the training pipeline and the agentic inference loop in a single open codebase, covering planning, tool use, and multi-turn RL training through a lightweight environment server called Orchard Env. It is one of the first open frameworks where the full loop, from environment interaction through policy update, runs without a proprietary dependency. Teams that have been blocked from reproducing agentic RL work by infrastructure constraints now have a reference stack to build from. link

06 [Eval] EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents Models optimized on single-axis voice evals, either conversation quality or task completion, fail on the combined metric by a measurable margin when both are scored together. EVA-Bench addresses this by orchestrating bot-to-bot audio conversations over dynamic multi-turn dialogues with automatic simulation validation, then measuring quality across the full scope of voice-specific failure modes in one pass. No existing benchmark jointly covers both challenges. Teams deploying enterprise voice agents should treat single-axis eval scores as incomplete and run EVA-Bench's combined metric before drawing conclusions about production readiness. link