Also Worth Noting - 2026-05-07

From single-decode hallucination detection to population-level diversity collapse, five papers that cut inference cost or expose hidden evaluation blind spots

Also Worth Noting

02 [Eval] The First Token Knows: Single-Decode Confidence for Hallucination Detection First-token entropy, computed from a single greedy decode, matches or beats semantic self-consistency for hallucination detection without any external NLI model. The signal comes from normalized entropy over the top-K logits at the first content-bearing answer token, a calculation that costs nothing beyond the forward pass already running. That cuts the inference overhead of multi-sample consistency checks by 5-10x. Teams running hallucination filters in production pipelines can replace the sampling loop with a logit read at position one. link

03 [Training] Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients Negative rollouts in GRPO carry no gradation of failure severity, and sampling a few bad trajectories from a combinatorially large failure space penalizes the policy poorly. Positive-only policy optimization sidesteps this by deriving implicit negative gradients from the positive batch alone, skipping negative rollout generation entirely. On reasoning benchmarks the method matches GRPO accuracy. Half the RL compute budget currently spent generating and scoring failed trajectories may be recoverable without any accuracy cost. link

04 [Inference] Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks Ternary weights constrained to {-1, 0, +1} theoretically eliminate floating-point multiplication, but existing frameworks still treat them as dense float networks and forfeit that advantage entirely. Custom SIMD kernels built for ternary structure reclaim it, hitting 4-8 tokens per second on a standard laptop CPU with no discrete GPU. That puts local LLM inference within reach of the billion personal computers that will never touch datacenter hardware. Worth watching for edge deployment pipelines where cloud API latency or cost is a hard constraint. link

05 [Agent] Milestone-Guided Policy Learning for Long-Horizon Language Agents Credit misattribution is the core reason agent RL collapses on tasks longer than roughly 20 steps: correct early actions get penalized whenever the trajectory ends in terminal failure. BEACON fixes this by partitioning trajectories at learned milestone checkpoints, assigning credit locally at each segment rather than globally at the end. The framework cuts wasted trajectories by roughly half on long-horizon benchmarks. Teams fine-tuning agents on multi-step tasks with sparse terminal rewards should treat milestone scaffolding as a prerequisite, not an optional add-on. link

06 [Eval] Ex Ante Evaluation of AI-Induced Idea Diversity Collapse AI writing tools can raise individual output quality scores while shrinking idea diversity across a population by 30-40%, a crowding effect that per-user evals are structurally blind to. The framework models ideas as congestible goods and benchmarks AI-induced homogenization against matched unaided human baselines, requiring no human-AI interaction data to run. Standard quality metrics at the individual level will show improvement even as the population converges on the same outputs. Any team deploying generative tools at scale should run population-level diversity checks alongside per-output quality scores. link