Also Worth Noting - 2026-05-29

Belief drift, detector routing, unified retrieval, spatial shortcuts, and rubric reward modeling , five fixes for real production gaps

Also Worth Noting

02 [Agent] When Should Models Change Their Minds? Contextual Belief Management in Large Language Models Belief drift in long conversations is a quantifiable failure mode, not a vague alignment concern. BeliefTrack, a closed-world benchmark spanning Rule Discovery and Circuit Diagnosis, measures exactly when a model should update its state versus hold it, using symbolic verifiers for turn-level scoring. Current LLMs fail in two distinct directions: over-updating on irrelevant noise and under-updating on genuine evidence. Teams building multi-turn agents now have a concrete diagnostic instead of a gut-check. link

03 [Inference] Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense Committing every request to a single prompt-injection detector means inheriting that detector's blind spots on every request. SCOUT reframes defense as a per-request allocation problem: it predicts each detector's reliability and latency on the incoming sample, then routes accordingly and escalates to an LLM judge only when uncertainty warrants it. The result is a heterogeneous detector pool that covers more attack surface without running all detectors on all traffic. For teams running inference pipelines at scale, this cuts both risk and cost simultaneously. link

04 [RAG] OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources Most enterprise RAG stacks maintain a separate retriever for each data type , unstructured text, relational tables, knowledge graphs, property graphs , because collapsing them into a shared embedding space erases the structural affordances that make each source useful. OmniRetrieval keeps those structural affordances intact under a single interface, routing queries without forcing a lowest-common-denominator representation. A single model matches or beats specialized retrievers across all four source types. That directly cuts the infrastructure overhead of maintaining parallel retrieval pipelines. link

05 [Eval] Why Far Looks Up: Probing Spatial Representation in Vision-Language Models VLMs that score well on spatial reasoning benchmarks are largely exploiting a statistical shortcut: vertical image position correlates with distance in natural photography, and models learn that correlation rather than genuine 3D structure. A representation-level analysis using minimal contrastive embedding pairs reveals a consistent vertical-distance entanglement across multiple model families. Spatial reasoning scores collapse when those co-occurrence cues are removed. Current spatial benchmarks measure dataset bias more than 3D competence, which means reported gains in this area deserve skepticism. link

06 [Training] RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains Rubric-based reward modeling for subjective tasks breaks down because Boolean aggregation of criteria produces too many ties, making the reward signal useless for RL. RUBRIC-ARROW fixes this by jointly training a rubric generator and a rubric-conditioned judge in alternating stages, using a probability-based scoring rule that converts hard Boolean checks into continuous scores. The RL stage uses only pairwise preference data, removing the dependency on frontier LLM judges entirely. Teams post-training on creative or open-ended tasks now have a path to reliable reward modeling without GPT-4-class infrastructure. link