Also Worth Noting - 2026-05-26

Five papers reshuffling assumptions across retrieval, inference scaling, CoT auditing, edge MoE, and long-context memory

Also Worth Noting

02 [RAG] Your Embedding Model is SMARTer Than You Think Standard single-vector embedding models already encode fine-grained, token-level local evidence , ColBERT-style multi-vector pipelines may not be the only way to get it. SMART extracts that latent multi-vector signal from an already-trained contrastive model without retraining, using a framework that unlocks what standard compression was discarding all along. The practical upshot: teams running expensive multi-vector retrieval infrastructure should test whether their existing embedding model closes the gap before committing to a heavier pipeline. link

03 [Inference] Share More, Search Less: Collaborative Parallel Thinking for Efficient Test-Time Scaling Parallel test-time scaling bleeds compute because branches never share what they discover mid-search, causing the same findings to be rediscovered repeatedly across isolated threads. Cross-branch communication fixes this: intermediate reasoning is shared in real time so branches build on each other instead of duplicating work. At the same compute budget, answer quality improves and total search steps drop. Teams using parallel sampling for hard reasoning tasks should treat branch isolation as a tunable cost, not a fixed architectural constraint. link

04 [Eval] Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth Every popular chain-of-thought faithfulness metric tested fails to correlate with ground-truth labels of whether a reasoning trace reflects actual model computation. The core problem is that prior metric proposals compared scores against each other rather than against a verifiable ground truth, so false assurance compounded across the literature. Teams using CoT traces for auditing or compliance should treat current faithfulness scores as unreliable signals until metrics are validated against causal ground truth. link

05 [Hardware] MobileMoE: Scaling On-Device Mixture of Experts MoE has been a hundred-billion-parameter trick; MobileMoE shows it sets a new Pareto frontier at 0.3-0.9B active parameters and 1.3-5.3B total, operating within mobile memory and compute constraints. An on-device MoE scaling law jointly optimizes sparsity and architecture, identifying a sweet spot where moderate sparsity yields more capability per active FLOP than dense alternatives. Teams targeting edge deployment can get meaningfully more model capacity without increasing active compute by adopting sparse expert routing at sub-billion scale. link

06 [Training] Language Models Need Sleep Instead of extending context windows, this approach periodically compresses recent context into fast weights via a learned local update rule, then clears the KV cache entirely. During each sleep phase, the model runs N offline recurrent passes over accumulated context and encodes it into state-space model blocks, shifting extra computation off the live inference path. The tradeoff is exact recall for scalable long-horizon operation, with no architectural overhaul required. Worth watching for teams hitting KV cache memory walls on long-running agent tasks. link