Also Worth Noting - 2026-06-10

Five papers tightening inference pipelines and training loops, from VLM token routing to psychologically informed refusals

Also Worth Noting

02 [Inference] Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models Visual token importance shifts across decoder depth, so permanently discarding low-ranked tokens early can erase information that later layers actually need. Reroute keeps those tokens in a recoverable cache instead of deleting them, letting later layers pull them back when grounding-sensitive queries demand it. The fix is training-free and slots into existing token-reduction pipelines without meaningful latency overhead. Teams running VLMs with aggressive token pruning should treat irreversible removal as a fragility point, not a safe default. link

03 [Inference] VIA-SD: Verification via Intra-Model Routing for Speculative Decoding Binary accept-or-recompute decisions in speculative decoding waste compute on tokens that sit in a middle zone: too uncertain for the drafter, but not uncertain enough to require the full verifier. VIA-SD routes those near-miss rejects to a slim submodel derived from the full verifier through intra-model routing, handling moderate-confidence cases without triggering an expensive full-model call. The result is a three-tier verification path that recovers a measurable fraction of rejected drafts at lower cost. Teams running speculative decoding in production should audit how many rejected tokens fall into that recoverable middle band. link

04 [Training] Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling Multi-token prediction acceptance rates collapse during RL training not because the draft model is poorly designed, but because RL shrinks output entropy faster than the draft model expects. Bebop traces this to an entropy mismatch and fixes it with rejection sampling at the rollout stage, restoring acceptance rates and cutting wall-clock training time without altering the core RL objective. The paper also ships practical recipes for integrating MTP into large-scale RL pipelines. Teams using MTP to accelerate RL rollouts should check entropy trajectories before blaming draft model quality. link

05 [Theory] Redesign Mixture-of-Experts Routers with Manifold Power Iteration Standard MoE router rows are initialized without any constraint tying them to the expert matrices they proxy, so dot-product similarity between tokens and router rows reflects surface geometry rather than true token-expert affinity. Manifold power iteration aligns each router row to the principal singular direction of its associated expert matrix, giving the router a principled encoding of what each expert actually does. The alignment adds no parameters and improves routing quality across tested configurations. Teams training MoE models from scratch should consider this as a low-cost router initialization strategy. link

06 [Application] PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models A blunt refusal in a crisis or coercion scenario can block direct harm while still failing the person who sent the request, a distinction current RLHF pipelines do not capture. PsychoSafe reframes refusal as structured supportive communication grounded in evidence-based intervention strategies, building a curated corpus to train models that decline harmful requests while addressing the underlying need. Psychologically informed responses outperform blunt non-compliance in high-risk interactions across the benchmark. Safety teams designing refusal policies for consumer-facing deployments should treat the quality of the refusal, not just its presence, as an outcome to optimize. link