Also Worth Noting - 2026-06-07

Consensus illusions, scaffold artifacts, and compute waste: five papers exposing hidden inefficiencies across agent, retrieval, and training pipelines

Also Worth Noting

02 [Agent] The Consistency Illusion: How Multi-Agent Debate Hides Reasoning Misalignment Answer-level consensus in multi-agent medical QA is not a reliability signal. The CARA metrics expose a specific failure mode: agents that agree on a final answer can be reasoning from entirely different chains, meaning the debate mechanism suppresses visible disagreement without resolving underlying misalignment. Tested on MedQA-USMLE and MedThink-Bench, CARA quantifies the gap between answer agreement and reasoning agreement. Teams treating consensus as a confidence proxy in high-stakes pipelines are measuring the wrong thing. link

03 [Eval] Scaffold Effects on GAIA: A Controlled Comparison Published agent benchmark scores are partly a scaffold engineering artifact, not a pure model capability reading. This controlled comparison holds tasks and conditions fixed across three scaffolds (ReAct, Planner-Actor-Rater, and planner-then-executor) and five models from three providers, running three attempts per question on GAIA Levels 1 and 2. Scaffold choice alone moves scores by a measurable margin independent of which model is underneath. Anyone reading agent leaderboards should treat the scaffold as a confound on the same order as the model itself. link

04 [RAG] When Should Queries Be Decomposed? A Stage-Aware Study of Query Decomposition for Multi-Condition Retrieval Query decomposition hurts at the retrieval stage and helps at the reranking stage, which means applying it uniformly across a pipeline produces net-negative results. The mechanism at retrieval is semantic dilution: splitting a multi-condition query into sub-queries loses the joint constraint signal that vector similarity needs to find the right documents. At reranking, those same sub-queries enable fine-grained constraint verification that a single composite query cannot. Teams using decomposition as a blanket strategy should gate it to the reranking step only. link

05 [Training] Lost in the Non-convex Loss Landscape: How to Fine-tune the Large Time Series Model? Standard fine-tuning of pre-trained large time series models can perform worse than training from scratch, because the pre-trained loss landscape is poorly conditioned and non-convex in ways that standard optimizers cannot handle. The paper maps which conditioning interventions actually recover trainability rather than just masking overfitting, giving practitioners a concrete diagnostic before committing to a fine-tuning strategy. Anyone adapting foundation forecasting models to domain-specific data should check landscape conditioning before assuming pre-training provides a useful starting point. link

06 [Inference] sGPO: Trading Inference FLOPs for Training Efficiency in RLVR Fixed rollout budgets in RLVR training waste compute symmetrically: easy queries produce near-zero advantage because the policy already solves them, and unsolvable queries produce no gradient signal at all. sGPO sorts queries by estimated difficulty and allocates rollout budget selectively, trading a small number of inference FLOPs to identify which queries sit in the trainable middle zone. The result is measurable training efficiency gains without accuracy loss. Teams running RLVR at scale should treat rollout budget allocation as a first-class optimization target, not a fixed hyperparameter. link