Also Worth Noting - 2026-04-30
Semantic diversity at decode time, cross-architecture dLLM distillation, MCP credential leakage, lossless RL rollout speedups, and a clinician-grounded eval benchmark.
Also Worth Noting
02 [Inference] Large Language Models Explore by Latent Distilling Standard stochastic sampling generates lexically varied outputs but semantically redundant ones, which is the actual bottleneck for test-time compute scaling. ESamp targets this directly: a lightweight auxiliary model distills latent representations to steer decoding toward outputs that differ in meaning, not just surface tokens. The method exploits a known property of neural networks , higher prediction error on novel inputs , to identify and amplify genuinely unfamiliar generation directions. Teams using best-of-N or majority-vote scaling should evaluate whether their diversity budget is being spent on real semantic variation or token shuffling. link
03 [Agent] MCPHunt: An Evaluation Framework for Cross-Boundary Data Propagation in Multi-Server MCP Agents Credential leakage in multi-server MCP agents does not require a malicious prompt or adversarial model behavior , it emerges from workflow topology alone. MCPHunt uses canary-based taint tracking to reduce propagation detection to objective string matching, isolating verbatim credential flow across trust boundaries as a structural side effect of composing read/write tools. The benchmark is, to current knowledge, the first controlled setup that separates this structural risk from adversarial scenarios. Any multi-server MCP deployment with mixed read/write permissions should treat cross-boundary credential propagation as a default risk, not an edge case. link
04 [Training] Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding Autoregressive rollout generation is now the dominant wall-clock bottleneck in frontier RL post-training, and speculative decoding addresses it without touching the reward model, policy updates, or output distribution. The implementation runs inside NeMo-RL with a vLLM backend, supporting both synchronous and asynchronous execution modes. Because speculative decoding is lossless by construction, the acceleration carries no fidelity tradeoff , the target model's distribution is preserved exactly. Teams running RL post-training at scale can adopt this as an infrastructure-layer speedup with no changes to the optimization regime. link
05 [Eval] HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats Academic medical benchmarks test curated QA; HealthBench Professional tests what clinicians actually type into ChatGPT during a shift. The benchmark is built from real physician-authored conversations across three use cases , care consult, writing and documentation, and medical research , with rubrics written and iterated by clinicians rather than benchmark designers. That grounding captures the context-dependent, open-ended queries that existing benchmarks systematically exclude and that determine real deployment safety. Teams evaluating models for clinical deployment should treat performance on curated medical QA as a floor, not a proxy for production behavior. link