Also Worth Noting - 2026-05-18

Five papers exposing hidden failure modes in training, evaluation, and deployment of LLM reasoning systems

Also Worth Noting

02 [Training] Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding Selecting complete reasoning traces post-hoc wastes compute and misses complementary signal across teacher models. CoRD fixes this by having heterogeneous large reasoning models hand off mid-chain, with predictive perplexity-based scoring and beam search determining which teacher continues at each step. The result is more diverse, less redundant training data for small reasoning models without the brute-force sampling overhead. Teams distilling Long-CoT reasoning into smaller models should treat step-wise multi-teacher decoding as a direct replacement for trace-selection pipelines. link

03 [Eval] CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence A model can score perfectly on Doc-VQA answer accuracy while grounding every answer in the wrong passage, and current benchmarks will never catch it. CiteVQA requires models to return element-level bounding-box citations alongside answers, scoring evidence attribution separately from answer correctness. The gap between the two scores on current multimodal LLMs is large. For teams deploying document intelligence in law, finance, or medicine, answer accuracy alone is not a safety signal. link

04 [Theory] Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR RLVR training hits a ceiling not because the reward signal is weak but because the policy only ever samples trajectories it already knows. NudgeRL addresses this by steering rollouts toward underexplored strategy regions during sampling, producing structured diversity without multiplying rollout count. The framework improves reasoning gains per compute dollar compared to simply scaling the number of rollouts. Teams running RLVR fine-tuning that have hit a performance plateau should look at exploration structure before adding more compute. link

05 [Agent] Auditing Agent Harness Safety An agent harness can return a correct, benign final answer while leaking context to the wrong sub-agent or accessing unauthorized tools mid-trajectory, and output-level safety benchmarks are blind to both failures. The paper shows that most violations occur mid-trajectory rather than at termination, a region current evaluation frameworks do not score. Auditing harness safety requires tracing permission boundaries and information-flow constraints across the full execution path, not just the terminal state. Teams shipping multi-agent pipelines in production need trajectory-level audit coverage, not just output checks. link

06 [Training] Hölder Policy Optimisation GRPO's fixed aggregation of token-level probabilities is a structural constraint on policy update quality that persists regardless of how strong the reward signal is. HölderPO replaces it with a parameterized Hölder-mean aggregation that adapts per sequence, breaking the binary trade-off between training collapse and underperformance that fixed aggregations produce. The approach yields measurable gains on math reasoning benchmarks without changing the reward setup. Teams using GRPO for reasoning model training should treat the aggregation function as a tunable component, not a fixed architectural given. link