Also Worth Noting - 2026-05-24

Five papers tightening the screws on LLM training signals, faithfulness audits, inference cost, and caching accuracy

Also Worth Noting

02 [Training] NITP: Next Implicit Token Prediction for LLM Pre-training Standard next-token prediction leaves hidden states under-constrained, pushing representations into anisotropic configurations that hurt generalization. NITP fixes this by adding a dense continuous supervision signal directly in latent space during pre-training, training the model to predict the implicit semantic content of the next token alongside the discrete label. No architecture changes ship at inference time, so the fix costs nothing at deployment. Teams pre-training or continually training LLMs on custom corpora have a clean, low-overhead path to better-structured representations. link

03 [Eval] Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth Every team using chain-of-thought faithfulness metrics to audit model behavior is measuring the wrong thing. This meta-evaluation constructs ground-truth faithfulness labels and shows that current metrics correlate poorly with them, exposing a structural gap between what the scores report and what actually happens inside the model. The problem is not a calibration issue; it is that the metrics were never validated against observable ground truth. Any interpretability or compliance workflow that relies on CoT faithfulness scores should treat those scores as unvalidated until a replacement metric clears this bar. link

04 [Training] Hide to Guide: Learning via Semantic Masking RLVR stalls on hard problems because the model never receives a reward signal to learn from, and feeding raw expert traces leaks answer-relevant content that creates reward-hacking shortcuts. Semantic masking strips reward-relevant spans from expert traces before exposing them, giving the model exploration signal without teaching it to copy the answer path. The approach targets exactly the sparse-reward failure mode that makes RL fine-tuning unreliable on difficult reasoning tasks. Teams running RLVR pipelines on math or code should treat this as a practical intervention before scaling compute. link

05 [Inference] Locality Matters for Training-Free Audio Token Compression in Audio-Language Models Dropping audio tokens by global importance scores discards neighboring context that carries meaning, wasting the context budget rather than saving it. Locality-aware compression preserves contiguous token groups rather than isolated salient tokens, cutting audio prefix length by over 50 percent with minimal quality loss and no retraining required. The finding reframes audio token pruning as a local structure problem, not a global ranking problem. Any audio-language deployment bottlenecked by long prefix sequences can apply this training-free method directly. link

06 [RAG] MVR-cache: Optimizing Semantic Caching via Multi-Vector Retrieval and Learned Prompt Segmentation Single-vector semantic caches miss near-duplicate prompts that differ in phrasing, sending redundant LLM calls that a better cache would have absorbed. MVR-cache replaces the single-vector lookup with multi-vector retrieval built on a learnable segmentation model that splits prompts into semantically meaningful chunks and runs fine-grained similarity comparisons via MaxSim. The training objective is derived from a theoretical analysis of retrieval accuracy, not tuned empirically. Teams running semantic caching at scale get a drop-in upgrade that closes the phrasing-variation gap without changing the surrounding infrastructure. link