Brief
AI research papers, explained for builders.
- Worth Reading
Strip the Leakage, and the LLM Forecasting Edge Mostly Disappears
A 36-month leakage-controlled test shows a 7B RAG forecaster's median IC of +0.154 is largely explained by macro-analog retrieval, not LLM capability.
- Also Worth Noting · 5 notes
- 02MacAgentBench: Benchmarking AI Agents on Real-World macOS Desktop
- 03Learning at the Right Pace: Adaptive Data Scheduling Improves LLM Reinforcement Learning
- 04BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories
- 05Beyond Penalizing Mistakes: Stabilizing Efficiency Training in Large Reasoning Models via Adaptive Correct-Only Rewards
- 06SVGym (SciVerseGym): An Environment for Reinforcement Learning and Bayesian Optimization in Crystal Discovery
- Worth Reading
Robots That Play First Solve Tasks Better: 20-Point Gains Without Extra Instructions
Self-directed robot play before task assignment builds a reusable skill library that lifts downstream performance by up to 20.6 points, no finetuning required.
- Also Worth Noting · 5 notes
- 02ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?
- 03Thinking with Visual Grounding
- 04Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents
- 05FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines
- 06Understanding the Behaviors of Environment-aware Information Retrieval
- Worth Reading
ContextRL Trains Models to Find the One Sentence That Actually Matters
A new auxiliary RL objective forces LLMs to select the context fragment that supports an answer, yielding +2.2% on long-horizon agent benchmarks and +1.8% on visual QA.
- Also Worth Noting · 5 notes
- 02S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence
- 03Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation
- 04LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents
- 05LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI
- 06HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining
- Worth Reading
SAE Feature Clamping Gets a 95.8% Bypass Rate
Clamping SAE features suppresses one path to harmful behavior, not the behavior itself. Models recover through the unexplained reconstruction residual.
- Also Worth Noting · 5 notes
- 02STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability
- 03EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts
- 04Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games
- 05CEO-Bench: Can Agents Play the Long Game?
- 06Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems
- Worth Reading
The Field's Go-To GUI Agent Dataset Actively Breaks Fine-Tuning
ProCUA-SFT shows AgentNet causes negative transfer in CUA fine-tuning, while 3.1M synthetic steps lift OSWorld from 26.3% to 45.0%.
- Also Worth Noting · 5 notes
- 02Learning from the Self-future: On-policy Self-distillation for dLLMs
- 03The Price of Anarchy in Disaggregated Inference
- 04A Gradient Perspective on RLVR Stability and Winner Advantage Policy Optimization
- 05FastContext: Training Efficient Repository Explorer for Coding Agents
- 06Rethinking the Role of Efficient Attention in Hybrid Architectures
- Worth Reading
Same Success Rate, Completely Different Failure Modes: Web Agent Eval Is Broken
WebStep's 1,800-task benchmark reveals that agents scoring identically on task success diverge sharply on where and how they fail mid-workflow.
- Also Worth Noting · 5 notes
- 02Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving
- 03Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning
- 04Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning
- 05Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders
- 06All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code
- Worth Reading
Expert Exam Scores Don't Predict Medical LLM Reliability Under Pressure
MedMisBench shows LLM accuracy drops from 71.1% to 38.0% when misleading context is injected, exposing a structural gap in medical AI evaluation.
- Also Worth Noting · 5 notes
- 02APPO: Agentic Procedural Policy Optimization
- 03The Hidden Power of Scaling Factor in LoRA Optimization
- 04Skip a Layer or Loop It? Learning Program-of-Layers in LLMs
- 05No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions
- 06Rethinking RAG in Long Videos: What to Retrieve and How to Use It?
- Worth Reading
Frozen Safety Monitors Break After Fine-Tuning, Not After Quantization
Activation monitors trained on base models degrade sharply after LoRA fine-tuning but survive quantization, exposing a silent gap in most production safety stacks.
- Also Worth Noting · 5 notes
- 02Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations
- 03Formalize Once, Edit the Rest: Efficient Lean-Based Answer Selection for Math Reasoning
- 04Who Flips? Self- and Cross-Model Counterarguments Reveal Answer Instability in LLMs
- 05PreLort: Prefix-Nested LoRA for Federated Fine-Tuning under Rank Heterogeneity
- 06Open-SWE-Traces: Advancing Dual-Mode Multilingual Distillation for Software Engineering Agents
- Worth Reading
EvoTrainer: Fixing the Training Harness While Tuning the Policy Is a False Economy
EvoTrainer co-evolves LLM policies and training harnesses simultaneously, matching or beating human-engineered RL baselines across math, code, and SWE tasks.
- Also Worth Noting · 5 notes
- 02Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization
- 03DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch
- 04High-Fidelity Two-Step Image Generation via Teacher-Aligned End-to-End Distillation
- 05ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics
- 06Adaptive Multi-Resolution Procedural Knowledge Compression for Large Language Models
- Worth Reading
Two Tokens Fix Hidden-State Recurrence: SWITCH Makes Latent Reasoning RL-Trainable
SWITCH adds discrete boundary tokens to latent chain-of-thought, making hidden-state recurrence compatible with standard on-policy RL and causally interpretable for the first time.
- Also Worth Noting · 5 notes
- 02FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents
- 03Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models
- 04N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization
- 05Flash-GMM: A Memory-Efficient Kernel for Scalable Soft Clustering
- 06On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance
- Worth Reading
MiniMax Sparse Attention Cuts Million-Token Compute by 28x Without Quality Loss
MSA delivers 28.4x attention compute reduction and 14.2x prefill speedup at 1M context on a 109B model, matching full GQA quality.
- Also Worth Noting · 5 notes
- 02Redesign Mixture-of-Experts Routers with Manifold Power Iteration
- 03ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
- 04Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs
- 05TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning
- 06On Subquadratic Architectures: From Applications to Principles
- Worth Reading
The Safety Tool That Became a Jailbreak: GCD's Hidden Attack Surface
Grammar-constrained decoding, used to enforce code validity, suppresses LLM refusal tokens and enables a 30+ point jailbreak success rate lift.
- Also Worth Noting · 5 notes
- 02Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models
- 03VIA-SD: Verification via Intra-Model Routing for Speculative Decoding
- 04Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling
- 05Redesign Mixture-of-Experts Routers with Manifold Power Iteration
- 06PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models
- Worth Reading
CoT Fine-Tuning Quietly Destroys Long-Context Recall in Hybrid LLMs
Chain-of-thought SFT collapses Needle-In-A-Haystack retrieval in hybrid linear-attention models, and a training-free QK weight restore fixes it.
- Also Worth Noting · 5 notes
- 02On the Geometry of On-Policy Distillation
- 03Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense
- 04FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention
- 05Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short
- 06i1: A Simple and Fully Open Recipe for Strong Text-to-Image Models
- Worth Reading
PPO's Ratio Clipping Has a Blind Spot. DRPO Fixes It.
DRPO replaces ratio-clipping in LLM RL with a smooth divergence regularizer, stabilizing off-policy training where PPO and GRPO break down.
- Also Worth Noting · 5 notes
- 02End-to-End Context Compression at Scale
- 03The Distillation Game: Adaptive Attacks & Efficient Defenses
- 04When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents
- 05ECI: Effective Contrastive Information to Evaluate Hard-Negatives
- 06Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions
- Worth Reading
On-Policy Distillation Breaks at the Prefix, Not the Token
Trajectory-Refined Distillation names the exact structural failure in on-policy distillation and fixes it at the trajectory level, not the token level.
- Also Worth Noting · 5 notes
- 02The Consistency Illusion: How Multi-Agent Debate Hides Reasoning Misalignment
- 03Scaffold Effects on GAIA: A Controlled Comparison
- 04When Should Queries Be Decomposed? A Stage-Aware Study of Query Decomposition for Multi-Condition Retrieval
- 05Lost in the Non-convex Loss Landscape: How to Fine-tune the Large Time Series Model?
- 06sGPO: Trading Inference FLOPs for Training Efficiency in RLVR
- Worth Reading
MoE-to-Dense Conversion Beats Dense Pruning by 6.3 Points
A new framework converts trained Mixture-of-Experts models into standard dense networks, outperforming dense-to-dense pruning by 6.3 pp at matched parameter count.
- Also Worth Noting · 5 notes
- 02When Behavioral Safety Evaluation Fails: A Representation-Level Perspective
- 03POISE: Position-Aware Undetectable Skill Injection on LLM Agents
- 04Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses
- 05Chiaroscuro Attention: Spending Compute in the Dark
- 06Phase Marginalization for Patch-Grid Instability in Vision Transformers
- Worth Reading
The Part of Your LLM You Throw Away Is Quietly Corrupting Your Embeddings
EmbedFilter uses the unembedding matrix to remove high-frequency token bias from LLM embeddings, improving MTEB zero-shot performance while cutting index size.
- Also Worth Noting · 5 notes
- 02The Cold-Start Safety Gap in LLM Agents
- 03SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces
- 04The Shape of Addition: Geometric Structures of Arithmetic in Large Language Models
- 05GENEB: Why Genomic Models Are Hard to Compare
- 06SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating
- Worth Reading
Code2LoRA Matches Per-Repo Fine-Tuning at Zero Inference Token Cost
A hypernetwork generates repository-specific LoRA adapters on the fly, matching per-repo fine-tuning accuracy while adding zero inference-time token overhead.
- Also Worth Noting · 5 notes
- 02Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?
- 03LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs
- 04BraveGuard: From Open-World Threats to Safer Computer-Use Agents
- 05Complexity-Balanced Diffusion Splitting
- 06Token Budgets: An Empirical Catalog of 63 LLM-Agent Budget-Overrun Incidents, with an Affine-Typed Rust Mitigation as a Case Study
- Worth Reading
AI's Deployment Gap Is an Evaluation Problem, Not a Capability Problem
ALE benchmarks AI agents on 1,000+ real economic workflows, where current top systems average a 2.6% full pass rate.
- Also Worth Noting · 5 notes
- 02World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning
- 03Why Muon Outperforms Adam: A Curvature Perspective
- 04SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference
- 05Flash-WAM: Modality-Aware Distillation for World Action Models
- 06Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories
- Worth Reading
NTP's One-Hot Supervision Leaves Representation Space Broken by Design
NITP adds a dense continuous supervision signal in latent space during pre-training, lifting MMLU-Pro by 5.7% on a 9B MoE model with only 2% extra training FLOPs.
- Also Worth Noting · 5 notes
- 02Unified Neural Scaling Laws
- 03Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill
- 04Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning
- 05Large Language Models Hack Rewards, and Society
- 06Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models
- Worth Reading
Agentic Inference Is Structurally Wasteful: LayerRoute Fixes It in 6 Minutes
LayerRoute trains a 1.1M-parameter LoRA adapter that skips 15% of FLOPs on tool calls while barely touching planning steps, cutting agentic compute waste without retraining.
- Also Worth Noting · 5 notes
- 02HarnessForge: Joint Harness and Policy Evolution for Adaptive Agent Systems
- 03AdaCodec: A Predictive Visual Code for Video MLLMs
- 04LLM Anonymization Against Agentic Re-Identification
- 05Parametric Social Identity Injection and Diversification in Public Opinion Simulation
- 06SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction
- Worth Reading
On-Policy Distillation Without Logit Access: +28.64% on Math
OmniOPD removes the white-box teacher requirement from on-policy distillation, using chunk-level semantic verification to match or beat open-weight OPD baselines.
- Also Worth Noting · 5 notes
- 02Trust Region On-Policy Distillation
- 03SABER: Benchmarking Operational Safety of LLM Coding Agents in Stateful Project Workspaces
- 04LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning
- 05Neural Network Compression by Approximate Differential Equivalence
- 06When Hard Negatives Hurt: Bridging the Generative-Discriminative Gap in Hard Negative Synthesis for Retrieval
- Worth Reading
LoRA Has a Memory Ceiling, and Now You Can Calculate It
A new power law quantifies LoRA's exact parametric memory capacity, giving teams a principled ceiling instead of trial-and-error rank tuning.
- Also Worth Noting · 5 notes
- 02When Should Models Change Their Minds? Contextual Belief Management in Large Language Models
- 03Send a SCOUT First: Pre-hoc Reasoning for Adaptive Detector Allocation in Prompt-Injection Defense
- 04OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources
- 05Why Far Looks Up: Probing Spatial Representation in Vision-Language Models
- 06RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains
- Worth Reading
Safety Benchmarks May Be Measuring Evaluation Awareness, Not Alignment
Models fine-tuned on texts describing evaluation practices score significantly safer on benchmarks without any change in actual deployment behavior.
- Also Worth Noting · 5 notes
- 02Self-Improving Language Models with Bidirectional Evolutionary Search
- 03How LoRA Remembers? A Parametric Memory Law for LLM Finetuning
- 04When Should Models Change Their Minds? Contextual Belief Management in Large Language Models
- 05Agent Explorative Policy Optimization for Multimodal Agentic Reasoning
- 06EarlyTom: Early Token Compression Completes Fast Video Understanding
- Worth Reading
Safety Benchmarks May Measure Test-Awareness, Not Alignment
Models fine-tuned on evaluation meta-knowledge score safer on six safety benchmarks, exposing a new confounder that existing detection methods can't catch.
- Also Worth Noting · 5 notes
- 02Self-Improving Language Models with Bidirectional Evolutionary Search
- 03D2-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing
- 04MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
- 05Understanding Data Temporality Impact on Large Language Models Pre-training
- 06Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets
- Worth Reading
Agentic RL Training Actively Degrades Tool Judgment: A Fix in 18% Fewer Calls
AKBE fixes a training-loop failure where agentic RL makes models worse at knowing when NOT to call tools, recovering 25% higher tool productivity.
- Also Worth Noting · 5 notes
- 02Your Embedding Model is SMARTer Than You Think
- 03Share More, Search Less: Collaborative Parallel Thinking for Efficient Test-Time Scaling
- 04Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth
- 05MobileMoE: Scaling On-Device Mixture of Experts
- 06Language Models Need Sleep
- Worth Reading
AI Research Agents Fabricate Citations at 21%: A Verifiability Crisis
ScientistOne's Chain-of-Evidence framework exposes systematic fabrication in autonomous research agents, achieving zero hallucinated references where baselines fail at rates up to 21%.
- Also Worth Noting · 5 notes
- 02$D^2$-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing
- 03Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems
- 04The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation
- 05LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws
- 06Self-Improving CAD Generation Agents with Finite Element Analysis as Feedback
- Worth Reading
Silencing Agents Beats Letting Them Talk: DarkForest Cuts Errors 30.7%
DarkForest keeps multi-agent LLMs from seeing each other's reasoning, then aggregates structured beliefs , cutting errors 30.7% and token use 6.5x.
- Also Worth Noting · 5 notes
- 02NITP: Next Implicit Token Prediction for LLM Pre-training
- 03Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth
- 04Hide to Guide: Learning via Semantic Masking
- 05Locality Matters for Training-Free Audio Token Compression in Audio-Language Models
- 06MVR-cache: Optimizing Semantic Caching via Multi-Vector Retrieval and Learned Prompt Segmentation
- Worth Reading
Most Tokens in a Correct Response Are Getting the Wrong Credit Signal
DelTA shows RLVR's policy-gradient update acts as an implicit token-level discriminator, then fixes the distortion it creates , gaining 3.26 points on math benchmarks.
- Also Worth Noting · 5 notes
- 02Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention
- 03LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws
- 04FastKernels: Benchmarking GPU Kernel Generation in Production
- 05TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks
- 06Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning
- Worth Reading
RLVR Fine-Tuning Is Geometrically Wasteful: Rank-1 Extrapolation Matches Full Training
RELEX shows RLVR weight trajectories are rank-1 and near-linear, letting teams extrapolate full-run checkpoints from just 15% of training steps.
- Also Worth Noting · 5 notes
- 02WorldKV: Efficient World Memory with World Retrieval and Compression
- 03OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
- 04Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention
- 05The Distillation Game: Adaptive Attacks & Efficient Defenses
- 06It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs
- Worth Reading
Video MLLMs Fake Audio Understanding: Visual Hallucination at Scale
Every major video MLLM, including models from OpenAI and Google, substitutes visual inference for actual audio processing, a flaw now measurable and partially fixable.
- Also Worth Noting · 5 notes
- 02You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories
- 03Process Rewards with Learned Reliability
- 04PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents
- 05Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models
- 06optimize_anything: A Universal API for Optimizing any Text Parameter
- Worth Reading
Post-Trained MoE Models Can Skip Half Their Experts Without Retraining
ZEDA converts static MoE models into dynamic ones via self-distillation, cutting over 50% of expert FLOPs with ~1.20× speedup and minimal accuracy loss.
- Also Worth Noting · 5 notes
- 02Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models
- 03CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization
- 04PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents
- 05Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs
- 06optimize_anything: A Universal API for Optimizing any Text Parameter
- Worth Reading
Post-Trained MoE Can Skip Half Its Experts Without Retraining
ZEDA converts static MoE models into dynamic ones via self-distillation, cutting over 50% of expert FLOPs with minimal accuracy loss and no pretraining.
- Also Worth Noting · 5 notes
- 02Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding
- 03CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
- 04Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR
- 05Auditing Agent Harness Safety
- 06Hölder Policy Optimisation
- Worth Reading
Frontier Research Agents Pass at Under 22% on Consulting-Grade Work
A new benchmark with verifiable rubrics and cognitive traps reveals frontier deep research agents fail decision-grade consulting tasks at alarming rates.
- Also Worth Noting · 5 notes
- 02Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents
- 03VeriCache: Turning Lossy KV Cache into Lossless LLM Inference
- 04Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models
- 05Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization
- 06LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails
- Worth Reading
Dense Teacher Supervision Breaks Multi-Turn Agents. SDAR Fixes It.
SDAR adds a sigmoid-gated distillation layer on top of RL, lifting agent performance by up to 10.2% over GRPO across ALFWorld, WebShop, and Search-QA.
- Also Worth Noting · 5 notes
- 02HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution
- 03Many-Shot CoT-ICL: Making In-Context Learning Truly Learn
- 04WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
- 05Orchard: An Open-Source Agentic Modeling Framework
- 06EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
- Worth Reading
Agents Score 55% on Belief Invalidation: The Silent Memory-Rot Problem
STALE benchmark reveals frontier LLMs fail to detect implicit memory conflicts, scoring only 55.2% on belief invalidation across 1,200 queries.
- Also Worth Noting · 5 notes
- 02IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation
- 03LiSA: Lifelong Safety Adaptation via Conservative Policy Induction
- 04Long Context Pre-Training with Lighthouse Attention
- 05RouteProfile: Elucidating the Design Space of LLM Profiles for Routing
- 06FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale
- Worth Reading
One Base Model, One Million Policies: MinT's LoRA Adapter Architecture
MinT keeps a single resident base model and hot-swaps LoRA adapters at request time, cutting per-policy GPU cost to near-zero at million-adapter scale.
- Also Worth Noting · 5 notes
- 02KL for a KL: On-Policy Distillation with Control Variate Baseline
- 03AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
- 04FrameSkip: Learning from Fewer but More Informative Frames in VLA Training
- 05Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation
- 06FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
- Worth Reading
On-Policy Distillation Hurts When the Teacher's Context Is Wrong
A training-free diagnostic shows distillation guidance aligns with ideal gradients on incorrect rollouts but degrades on correct ones, breaking the 'distill from best model' default.
- Also Worth Noting · 5 notes
- 02SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
- 03Just Ask for a Table: A Thirty-Token User Prompt Defeats Sponsored Recommendations in Twelve LLMs
- 04Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?
- 05Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
- 06Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
- Worth Reading
MatryoshkaLoRA: One Training Run, Every Rank You Need
A nested diagonal matrix inside LoRA adapters eliminates rank grid search, yielding multiple efficiency-accuracy operating points from a single fine-tuning run.
- Also Worth Noting · 5 notes
- 02LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
- 03Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training
- 04Scaling Continual Learning to 300+ Tasks with Bi-Level Routing Mixture-of-Experts
- 05ExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks?
- 06ModelLens: Finding the Best for Your Task from Myriads of Models
- Worth Reading
LEAD Cuts Chain-of-Thought Length Without Accuracy Loss
LEAD uses adaptive RL reward shaping to eliminate CoT padding, achieving top accuracy and efficiency scores across five math benchmarks.
- Also Worth Noting · 5 notes
- 02LLM Agents Already Know When to Call Tools -- Even Without Reasoning
- 03Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks
- 04LEVI: Stronger Search Architectures Can Substitute for Larger LLMs in Evolutionary Search
- 05On-Policy Distillation with Best-of-N Teacher Rollout Selection
- 06Crosslingual On-Policy Self-Distillation for Multilingual Reasoning
- Worth Reading
SkillOS: The Curation Bottleneck That Keeps LLM Agents Stuck at Zero
SkillOS uses RL to train a dedicated skill curator that filters and evolves reusable agent experience, beating memory-based baselines on multi-turn tasks.
- Also Worth Noting · 5 notes
- 02MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
- 03Continuous-Time Distribution Matching for Few-Shot Diffusion Distillation
- 04StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
- 05MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval
- 06Coordination Matters: Evaluation of Cooperative Multi-Agent Reinforcement Learning
- Worth Reading
Every LLM Throws Away Token Identity After Layer One. TIDE Doesn't.
TIDE re-injects token identity at every transformer layer, directly fixing rare-token undertraining and contextual collapse in small models.
- Also Worth Noting · 5 notes
- 02Continuous Latent Diffusion Language Model
- 03UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
- 04A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping
- 05Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
- 06Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes
- Worth Reading
MoE Experts Are Entangled by Default. EMO Fixes That at 1B Scale.
EMO pretrains MoE models so expert subsets specialize by domain, cutting 75% of experts at inference while losing only 1% accuracy.
- Also Worth Noting · 5 notes
- 02The First Token Knows: Single-Decode Confidence for Hallucination Detection
- 03Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
- 04Litespark Inference on Consumer CPUs: Custom SIMD Kernels for Ternary Neural Networks
- 05Milestone-Guided Policy Learning for Long-Horizon Language Agents
- 06Ex Ante Evaluation of AI-Induced Idea Diversity Collapse
- Worth Reading
SFT Before RL Is Actively Hurting Your Multimodal Model
PRISM inserts a black-box on-policy distillation stage between SFT and RLVR, closing distributional drift without white-box model access and lifting accuracy by up to +6 points.
- Also Worth Noting · 5 notes
- 02Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers
- 03ReaComp: Compiling LLM Reasoning into Symbolic Solvers for Efficient Program Synthesis
- 04On Semantic Loss Fine-Tuning Approach for Preventing Model Collapse in Causal Reasoning
- 05TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding
- 06Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs
- Worth Reading
GUI Agents Top Out at 21% on Multi-App Tasks That Mirror Real Work
WindowsWorld exposes a hard ceiling in current GUI agent evals: agents that look capable on single-app benchmarks collapse on professional cross-application workflows.
- Also Worth Noting · 5 notes
- 02Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling
- 03Counting as a minimal probe of language model reliability
- 04Linear-Time Global Visual Modeling without Explicit Attention
- 05Motion-Aware Caching for Efficient Autoregressive Video Generation
- 06Assessing Pancreatic Ductal Adenocarcinoma Vascular Invasion: the PDACVI Benchmark
- Worth Reading
The Most Engineered Agentic Gateway Scores 0.000 on Every Safety Check
A new audit framework exposes four structural failure modes in agentic-AI runtimes, with the leading open-source gateway scoring zero recall on all four.
- Also Worth Noting · 5 notes
- 02Retrieval with Multiple Query Vectors through Anomalous Pattern Detection
- 03Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment
- 04Principles and Guidelines for Randomized Controlled Trials in AI Evaluation
- 05NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
- 06ShiftLIF: Efficient Multi-Level Spiking Neurons with Power-of-Two Quantization
- Worth Reading
A Single Tool Call Can Poison an Agent's Memory for 100+ Sessions
Trojan Hippo achieves 85-100% attack success against frontier models by planting dormant payloads in agent memory via one untrusted tool call.
- Also Worth Noting · 5 notes
- 02Training Non-Differentiable Networks via Optimal Transport
- 03Stochastic Sparse Attention for Memory-Bound Inference
- 04Model Spec Midtraining: Improving How Alignment Training Generalizes
- 05What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models
- 06RefusalGuard: Geometry-Preserving Fine-Tuning for Safety in LLMs
- Worth Reading
Cross-Architecture dLLM Distillation: 0.6B Student, 48.78 HumanEval
TIDE is the first framework to distill diffusion LLMs across incompatible architectures, lifting a 0.6B student to 48.78 HumanEval against a 32.3 AR baseline.
- Also Worth Noting · 3 notes
- 02Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling
- 03Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
- 04Efficient Training on Multiple Consumer GPUs with RoundPipe
- Worth Reading
Your LangGraph Orchestrator Is Failing 24% of Travel Conversations
A controlled comparison shows in-context self-orchestration beats LangGraph on procedural tasks, cutting failure rates by half across three domains.
- Also Worth Noting · 4 notes
- 02Large Language Models Explore by Latent Distilling
- 03MCPHunt: An Evaluation Framework for Cross-Boundary Data Propagation in Multi-Server MCP Agents
- 04Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding
- 05HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats
- Worth Reading
Safety Signal Lives Inside the Model, Not Just at the End
SIREN probes internal LLM layers to detect harmful content, beating current guard models with 250x fewer trainable parameters.
- Also Worth Noting · 5 notes
- 02Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets
- 03Generating Place-Based Compromises Between Two Points of View
- 04Learning Evidence Highlighting for Frozen LLMs
- 05Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
- 06Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework
- Worth Reading
On-Policy Distillation Makes Models More Accurate and More Overconfident — Simultaneously
Standard model training methods boost accuracy but accidentally make AI overconfident—the model learns from information unavailable when actually deployed. A new approach separates accuracy training from confidence calibration, keeping accuracy gains while fixing the broken confidence estimates that break downstream systems like AI agents and retrieval pipelines.
- Also Worth Noting · 5 notes
- 02Terminal Wrench: Dataset of AI Reward Hacks
- 03Token-Efficient Agent for Long-Term Memory
- 04OneVL: One-Step Latent Reasoning for Autonomous Driving
- 05Scalable Multi-Agent Multi-View World Models
- 06Symbolic Guardrails for Safer AI Agent Actions
- Worth Reading
Learning Adaptive Reasoning Paths for Efficient Visual Reasoning
Visual reasoning models now route simple questions directly to answers instead of forcing every query through lengthy multi-step reasoning chains. This adaptive approach matches accuracy while slashing compute costs on the majority of real-world questions—which turn out to be straightforward factual or perceptual tasks, not complex reasoning problems.
- Also Worth Noting · 5 notes
- 02Benchmark for LLM Algorithmic Trading Strategies
- 03NTIRE 2026 Video Saliency Prediction Challenge
- 04AI architecture evolution mirrors biology's statistical patterns
- 05It's all about the angle: Your photos, re-composed
- 06Advanced VLM Runs on Compact Edge AI Device
- Worth Reading
Agents Can Now Learn From Their Own Past Reasoning, Without Retraining
AI agents can now learn from their own past reasoning without retraining by retrieving similar thinking patterns from a stored bank of completed tasks. This lets agents apply proven problem-solving strategies to new work instantly, improving success rates especially when tasks share logical structure but different wording—where traditional memory systems fail. For production teams, the payoff is clear: building a reasoning trace archive from day one becomes more valuable than the model itself.
- Also Worth Noting · 5 notes
- 02Modernizing Facebook Groups Search
- 033 new ways Ads Advisor is making Google Ads safer and faster
- 04QIMMA: Rigorously Ranking Arabic Language Models by Quality
- 05Grounding Korean AI Agents in Real Demographics
- 06OpenAI Scales Codex to Enterprises Globally
- Worth Reading
Deep Research Agents Fail on the Basics — and Current Benchmarks Can't See It
Research agents that write full reports are being tested on the live web—where results change daily and can't be repeated. A new evaluation framework replaces this chaos with frozen, realistic document collections for each task, finally making performance scores trustworthy and exposing hidden failure modes like poor citations or missing facts that single scores hide.
- Also Worth Noting · 5 notes
- 02Value Gradient Flow for Stable RL
- 03Easy Knowledge Transfer Between Different AI Models
- 04LongAct: Internal Signals for Long-Context RL in LLMs
- 05GlobalSplat: Efficient 3D Gaussian Splatting with Global Tokens
- 06Teacher-Student Fine-Tunes Reasoning Models Better
- Worth Reading
Reward models that explain themselves outperform those that just score
Reward models for image generation now produce detailed written critiques instead of just numerical scores, enabling them to guide image refinement rather than merely rank outputs. This makes AI image generators improvable without retraining—a critique-based feedback loop instantly produces better images by revising prompts based on the model's own reasoning about what went wrong.
- Also Worth Noting · 5 notes
- 02GameWorld: Standardized Evaluation for Multimodal Game Agents
- 03Continuous Diffusion AI Now Matches Traditional Language Models
- 04Pinpointing Key Tokens for Efficient AI Model Training
- 05Target Policy Optimization Decouples RL Updates
- 06Seedance 2.0: Unified Multi-Modal Audio-Video Generation
- Worth Reading
Computer-Use Agents Fail Safety Tests Even When Users Do Everything Right
AI agents that control computers now face a hidden danger: they fail safety tests even when users give completely harmless instructions. The problem emerges from what happens during task execution—malicious files the agent encounters or unintended side effects it causes—not from the original request. Over 90% of leading systems fail these tests, revealing that checking user inputs alone cannot catch these downstream harms.
- Also Worth Noting · 5 notes
- 02New Tokenization Helps AI Understand SVG Geometry
- 03Spec Kit Agents for Context-Aware AI Coding
- 04AI Learns to Play Challenging Pokemon Red
- 05Rethinking Diffusion Models from a Langevin Perspective
- 06NVIDIA's Nemotron OCR V2: Fast, Accurate Multilingual OCR
- Worth Reading
Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator
Video generation typically costs vastly more compute than understanding, making language-model-first architectures inefficient. Uni-ViGU flips this: starting with a video diffusion model as the foundation, it adds understanding as a lightweight add-on, letting both tasks share the generator's rich visual knowledge without architectural strain. This changes how teams can build unified systems for video captioning, QA, and generation.
- Also Worth Noting · 5 notes
- 02Audio-Omni: Single Model for All Audio Tasks
- 03ATANT: AI Continuity Evaluation Framework
- 04TorchUMM: A Unified Platform for Multimodal AI Models
- 05SciPredict: LLMs Predict Scientific Experiment Outcomes
- 06IceCache: Memory-efficient KV-Cache for Long LLM Sequences
- Worth Reading
The 3D point cloud field has a reproducibility problem, and it's structural
LIDARLearn unifies 55+ competing point cloud models—used in autonomous vehicles, drones, and robots—into a single standardized testing framework, finally making fair performance comparisons possible. Until now, incompatible codebases and preprocessing pipelines masked whether one method truly outperformed another, forcing teams to re-test every algorithm on their own data before deployment.
- Also Worth Noting · 5 notes
- 02Preventing Lost Work for AI Coding Agents
- 03TorchUMM: Unified Codebase for Multimodal Models
- 04LLM Agents Autonomously Improve Models with RL Post-Training
- 05Optimized Generative Image Compression with RDVQ
- 06Coordinating Semantic IDs for Better Short-Video Search
- Worth Reading
The 3D registration benchmark problem nobody fixed: models trained on perfect data, tested on perfect data
Factory robots and inspection systems fail with standard 3D registration models because they're trained on perfect synthetic data—clean scans with no noise or occlusion. R3PM-Net introduces the first real industrial datasets (Sioux-Cranfield and Sioux-Scans) to measure what actually works in production, finally closing the gap between lab benchmarks and grimy factory floors.
- Also Worth Noting · 5 notes
- 02AI Agent Reasoning Collapses Undetected
- 03AI Paints Step-by-Step Like Humans
- 04Agentic LLMs Reason Using Complex Graph Structures
- 05MARS enables multi-token generation for AR models
- 06Cheaper, Better AI Images with Scaled RL Training
- Worth Reading
Training 100B+ Models Without a Cluster: Memory Architecture Beats Hardware Scale
MegaTrain trains 100-billion-parameter models on a single GPU by storing weights in CPU memory and streaming them to the GPU layer-by-layer, eliminating the need for expensive multi-GPU clusters. Smaller research teams and organizations can now experiment with massive models on standard workstations instead of renting cloud computing time.
- Also Worth Noting · 5 notes
- 02DISCO AI Designs Enzymes, Catalytic Sites Included
- 03Benchmarking Realistic LLM Agent Skill Usage
- 04Optimizing Retrieval for AI Agents
- 05Why Model Pruning Fails Generative AI
- 06Video-MME-v2: Next-Gen Benchmark for Real Video Understanding
- Worth Reading
Agents Keep Relearning the Same Lessons: SkillX Builds a Shared Curriculum Instead
SkillX lets AI agents share what they learn across teams instead of each solving problems from scratch. By organizing experience into three levels—strategic plans, functional skills, and atomic moves—agents can instantly access and reuse the right knowledge for any task, eliminating wasteful redundant learning.
- Also Worth Noting · 5 notes
- 02ClawArena: Benchmarking AI Agents in Evolving Information Environments
- 03AURA: Always-On Real-Time Video Assistance
- 04TriAttention Compresses KV Cache for Long LLM Reasoning
- 05Vero: Open RL Recipe for General Visual Reasoning
- 06MinerU2.5-Pro: Boosting Document Parsing with Data Engineering
- Worth Reading
When the teacher cheats, the student memorizes instead of learns
When AI models learn from a "teacher" that sees the right answer, they memorize shortcuts instead of learning generalizable skills—a hidden failure of dense supervision. This work shows that mixing in sparse feedback from verifiable outcomes prevents this collapse, enabling stable long-term training where pure dense signals fail.
- Also Worth Noting · 5 notes
- 02InCoder-32B Generates Expert Reasoning for Industrial Code
- 03AI Builds Shared Spatial Understanding Via Dialogue
- 04Simple Sliding Window Beats Complex AI for Video Understanding
- 05Salt: Fast, Sharp Video Generation at Low Computational Cost
- 06Evaluating Agentic AI's Multimodal Tool Use
- Worth Reading
AI-generated videos are too consistent — and that's exactly how to catch them
AI-generated videos betray themselves through eerie consistency—their frames correlate with each other far too predictably because they're anchored to a fixed prompt, while real video naturally accumulates random camera shake and lighting flicker. A new detection method exploits this temporal fingerprint across entire videos rather than hunting for glitches in individual frames, making it resilient to improvements in generation quality.
- Also Worth Noting · 5 notes
- 02Diagonal-Tiled Attention Boosts LLM Efficiency
- 03ClawArena: Benchmarking AI in Dynamic Information
- 04Analyzing Noisy Labels for LLM Reasoning
- 05Subspace Control for Constrained LLM Steering
- 06TORA: Topological Alignment for 3D Shape Assembly
- Worth Reading
How Meta Used AI to Map Tribal Knowledge in Large-Scale Data Pipelines
AI coding assistants fail on large real-world codebases not because they're weak reasoners, but because they lack the organizational knowledge experienced engineers carry—which modules own what logic, which dependencies matter, which files are safe to touch. Meta fixed this by explicitly mapping their 4,100-file codebase's ownership structure and cross-repo dependencies, dramatically improving how often agents made useful edits. For any team seeing plausible-but-wrong code suggestions, the fix isn't a better model—it's encoding your codebase's actual structure.
- Also Worth Noting · 2 notes
- 02OpenAI Launches Pilot Fellowship for AI Safety Research Talent
- 03People-First Industrial Policy for the AI Age
- Worth Reading
Open-ended discovery systems are not truly open-ended; CORAL is the first framework to make them autonomous
Most AI discovery systems claim to explore freely, but hidden rules actually pre-decide their every move. CORAL removes these constraints, letting multiple AI agents work together through shared memory and recover from dead ends on their own—enabling genuine open-ended exploration for math, algorithms, and complex problems that require sustained multi-step searching.
- Also Worth Noting · 5 notes
- 02Investigating Autonomous Agents' Real-World Code Contributions
- 03Steering AI's Focus on Specific Image Details
- 04SKILL0: AI Agents Learn and Internalize Skills Deeply
- 05NearID: Separating Identity from Background in AI Vision
- 06T5Gemma-TTS Boosts Voice Cloning for Long Speech
- Worth Reading
LoRA Isn't the Default for Hybrid Models Anymore
For hybrid models combining recurrence and attention, tuning the recurrent layer's initial hidden state outperforms the standard LoRA approach by 10–24 percentage points on code tasks, using zero extra parameters and requiring no weight merging at deployment. This method works for narrow, data-scarce problems on models like Qwen and Falcon but doesn't apply to standard transformers or transfer to text-to-SQL tasks.
- Also Worth Noting · 5 notes
- 02ClawKeeper: Comprehensive Safety for OpenClaw Agents
- 03Terminal Agents Sufficient for Enterprise Automation
- 04Reasoning Shift: How Context Silently Shortens LLM Reasoning
- 05ViGoR-Bench: Generative AI Lacks Logical Reasoning
- 06AgentWatcher: Rule-Based Prompt Injection Defense
- Worth Reading
Treating vision and audio as second-class citizens has a cost
A new unified model treats vision, audio, and text as equal citizens by converting them all into discrete tokens, eliminating the separate encoders and translation layers that plague current multimodal systems. This architectural shift reduces error-compounding integration seams, making it dramatically simpler to build AI systems that truly integrate speech and images rather than bolting them onto language models as afterthoughts.
- Also Worth Noting · 5 notes
- 021000+ Medical Imaging Datasets for Foundation Models
- 03LLMs Predict Alzheimer's with Interpretable Tabular Data
- 04RAG Quantifies Transplant Guide Discrepancies
- 05TokenDial: Continuous Video Attribute Control
- 06Framework to Align LLM Behavior with Principles
- Worth Reading
AI agents read physics papers but do not reproduce them.
AI agents claiming strong coding skills fail to reproduce physics papers end-to-end, exposing a critical gap between understanding research and executing it. A new benchmark tested 30 physics tasks requiring agents to read papers and match published results—revealing that standard coding benchmarks mask blind spots crucial for automating real scientific work.
- Also Worth Noting · 5 notes
- 02ImagenWorld: Explaining Image Generation Model Failures
- 03Dynamic MoE prevents forgetting in vision language models
- 04Emergent Social Risks in Multi-Agent AI Systems
- 05STRIDE: Deciding When and What to Respond in Live Video
- 06AI Coder Unifies Specialized Expertise
- Worth Reading
Training in the Deployment Harness Closes the Benchmark-Production Gap
Cursor's Composer 2 trains coding models inside the actual deployment environment rather than on isolated benchmarks, eliminating the usual gap between test performance and real-world results. By running reinforcement learning on the same tools and structure deployed users see, the model learns to solve problems that actually matter instead of optimizing for curated datasets—a fundamental shift in how coding AI gets trained.
- Also Worth Noting · 5 notes
- 02Auditable AI for Full Medical Image Studies
- 03Untrustworthy AI Explanations in Chain-of-Thought
- 04Training Self-Driving AI for Rare Road Scenarios
- 05Holo3: Holographic Interface Breaks Computer Use Frontier
- 06We’re creating a new satellite imagery map to help protect Brazil’s forests.
- Worth Reading
Diffusion policy RL has a hidden unification problem — and it's slowing everyone down
Robot learning teams have been reinventing the same solutions repeatedly because no one agreed on how diffusion-based robot policies actually work. FlowRL maps all existing approaches onto a unified framework, revealing which innovations are genuinely new versus cosmetic variations—letting researchers skip redundant work and compare fairly for the first time.
- Also Worth Noting · 5 notes
- 02KVSculpt: Distilling KV Cache for Efficient LLM Inference
- 03SkyNet: MuZero for Uncertain Multi-Player Games
- 04Quantizing Memory for Longer AI Video Generation
- 05Binary Latent Protein Optimization
- 06RSR-core: Faster Low-Bit Matrix-Vector Multiplication
- Worth Reading
Cost volumes are stereo matching's sacred cow — warping alone just dethroned them
A new stereo vision method ditches the industry standard "cost volumes"—3D grids that compare pixels across image pairs—and instead uses iterative image warping to measure and fix misalignment directly. It's now the fastest and most accurate method on all three major benchmarks simultaneously, running 1.8–6.7x faster while cutting cross-domain error by 81%, making it immediately valuable for depth-sensing applications from robotics to autonomous vehicles.
- Also Worth Noting · 5 notes
- 02RL Helps LMs Reason with Multiple Answers
- 03Autonomous Agents Enhance Evolutionary Search Operations
- 04Calibri: Boosting Diffusion Transformers with One Simple Parameter
- 05AVControl: Efficient Audio-Visual Control Framework
- 06SlopCodeBench: Iterative Code Quality Benchmarking for AI
- Worth Reading
Video Agents That Decide What to Watch Before Watching It
EVA lets video AI systems decide which frames matter before processing them, using reinforcement learning to develop adaptive viewing strategies instead of watching everything uniformly. This cuts wasted computation on long videos—a critical bottleneck for any team building video understanding systems at scale.
- Also Worth Noting · 5 notes
- 02Multimodal AI Models Judge Themselves for Better Reasoning
- 03Measuring Physical Frame Rate in AI Videos
- 04CUA-Suite: Massive Video Demonstrations for Computer-Use Agents
- 05Self-Distillation Can Harm LLM Math Reasoning
- 06UI-Voyager: Self-Evolving Agent Learns from Failed Mobile Tasks
- Worth Reading
Long video QA breaks when models ignore what the video is already telling them
Most video QA systems fail on long videos because they match query words to segments in isolation, ignoring how scenes connect visually and temporally. VideoDetective treats the video as a graph where segments influence each other's relevance scores, letting it find clues that only make sense in context—fixing a fundamental flaw in how we retrieve answers from hours of footage.
- Worth Reading
Deep research agents do not need the internet; they need the right offline corpus
- Worth Reading
DoRA's memory wall breaks at high rank: a systems fix, not a math fix
- Also Worth Noting · 3 notes
- 04Omni-WorldBench: Comprehensive Interaction Evaluation for World Models
- 05Efficient VLM processing by focusing on high-resolution image crops
- 06Unified AI Model Generates Realistic Synchronized Human Video and Audio
- Worth Reading
3D reasoning in VLMs stems from perception issues, not language processing.
Vision-language models struggle with 3D spatial reasoning because they lack training signal, not because they need richer input data. This work trains models to reconstruct scenes and understand their own position within them, enabling video-based AI systems and AR applications to reason about space without preprocessing geometric data at inference time.
- Also Worth Noting · 2 notes
- 02Training Remote Sensing VLMs with OpenStreetMap
- 03PARSA-Bench: First Persian Audio-Language AI Benchmark
- Worth Reading
Real websites will get your agent banned — synthetic clones will get it trained
VeriEnv lets AI agents train on synthetic website clones instead of real sites, eliminating bot detection blocks and unreliable LLM judges. Agents now get deterministic feedback by reading internal site state, making web automation training 10x safer and faster—perfect for companies building search tools and automation pipelines before deploying to production.
- Worth Reading
The Search Agent Data Gap Has a Structural Fix — and the Numbers Behind It Are Now Public
- Worth Reading
Residual connections assume every layer matters equally — these results say they're wrong by design
- Also Worth Noting · 6 notes
- 04New Attention Trick Stops Deep AI Models From Forgetting Early Insights
- 05New Benchmark Tests AI Agents on Real Enterprise Workflows
- 06LLMs That Optimize Themselves Using Feedback and Rewards
- 07Two Rival AIs That Force Each Other to Write Better Code
- 08Teaching AI to Judge Which Research Ideas Are Worth Pursuing
- 09Benchmark Tests AI Agents on Evolving, Real-World Codebases
- Worth Reading
Most researchers are using AI wrong — here's the five-level map that shows why
For the first time, we have a clear map for where AI-assisted research actually sits—from asking ChatGPT questions to running fully autonomous agents overnight. The key insight: most teams lack guardrails to stop agents from reporting plausible-looking false results, turning verification itself into the critical failure point that needs explicit rules built into the agent's instructions.
- Worth Reading
Coding Agents Fail at Real-World Optimization—and Current Benchmarks Can't Even See It
- Also Worth Noting · 3 notes
- 03Attention Heads That Reach Back to Earlier Layers
- 04Teaching AI to Actually Use Unfamiliar Code Libraries
- 05CT scan AI bottleneck fixed with smarter retrieval method
- Worth Reading
Ensemble weighting that punishes disagreement outperforms static mixing in non-stationary sequential tasks
For ensemble models in shifting environments, a new weighting system tracks both individual performance and how much each model agrees with the others—penalizing those that drift from consensus. This catches failing specialists before their raw accuracy numbers do, and comes with formal guarantees that the approach won't fall too far behind an ideal fixed strategy, even as the optimal expert changes over time.
- Also Worth Noting · 12 notes
- 04Four Ways AI Safety and Ethics Communities Handle Their Fights
- 05New Benchmark Tests AI Agents' Step-by-Step Tool Decision Quality
- 06Smarter Training Trick Stops AI Models From Playing It Too Safe
- 07New Test Reveals If AI Actually Reads ECGs or Just Guesses
- 08Recursive AI loops tested for low-resource translation quality checks
- 09AI Safety Guard Catches Dangerous Household Robot Commands
- 10Physics-Based Framework Makes Low-Light AI Enhancement Far More Reliable
- 11AI Eyes That Scan Panoramas Like Real Humans Do
- 12Faster, Smarter Image-Text Matching via Optimal Transport
- 13Decomposing Training Gradients to Reveal What Models Actually Learned
- 14Robots That "Re-Look" Before Acting Solve Tasks Better
- 15AI Video Models Lack a True Sense of Physical Time
- Worth Reading
Static ensemble weights fail in non-stationary environments, and coherence between models carries the signal you're missing
When LLMs retrieve documents to answer questions, they excel at math puzzles but fail catastrophically on cryptographic proofs—even when the correct answer sits in their retrieved context. The problem: models trained on clean benchmarks don't learn to verify retrieved information against subtle real-world constraints, leaving production systems vulnerable to confident hallucinations on security-critical tasks.
- Worth Reading
LLMs That Ace Math Olympiads Collapse on Real Cryptographic Code Proofs
- Worth Reading
LLMs That Ace Python Collapse on a General-Purpose Language With Thin Training Data
- Also Worth Noting · 12 notes
- 04AI Search Agent That Learns From Its Own Past Mistakes
- 05Cloning Real Websites to Safely Train AI Web Agents
- 06AI System That Automatically Judges If Research Ideas Are New
- 07Graph Transformers Turn DNS Traffic Into Cyberattack Detectors
- 08Tiny Model Beats Bigger Ones at Understanding 3D Shapes
- 09New Tool Ranks AI Reasoning Models More Fairly
- 10New Tool Measures How Much Synthetic Data Leaks Privacy
- 11Mapping City Surface Materials in 3D Using Laser Scanners
- 12AI Images Look Too Vivid — Here's How to Fix That
- 13Robots That Keep Learning New Tasks Without Forgetting Old Ones
- 14AI Removes Haze From Photos Without Needing Labeled Training Data
- 15Fixing AI's Tendency to "Forget" Images in Long Documents
- Worth Reading
Text-to-image models fail at complex text because glyph templates were never in the loop
GlyphBanana lets AI image generators finally render complex text—formulas, CJK characters, mathematical symbols—by anchoring them with pre-made character templates instead of relying on training data that never existed. It works instantly on existing models without retraining, making it a direct solution for design tools and document generation systems that need reliable text in images.
- Worth Reading
DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning
For the first time, a single AI system can understand complex sports videos across multiple sports and tasks simultaneously—recognizing plays, interpreting rules, and analyzing tactics all at once. This works because the system learns through trial-and-error reasoning rather than memorization, enabling it to handle the fast motion and rule complexity that stump previous narrow models. Sports analytics teams and video AI researchers now have a unified blueprint replacing fragmented tool chains.
- Worth Reading
Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory (SSGM) Framework
AI agents that remember conversations over time are becoming common, but no one has yet figured out how to stop those memories from getting corrupted, manipulated, or drifting into false beliefs. This paper introduces the first framework to actively protect evolving agent memory—catching contradictions before they're stored and flagging memories that slowly change meaning—making long-term AI agents actually trustworthy.
- Also Worth Noting · 12 notes
- 04One Model Trained Smarter Serves Millions of Different Users
- 05AI That Thinks While Watching Live Video, Not After
- 06AI Agent That Reads Gene Activity to Explain Cell Biology
- 07New Benchmark Tests AI's Ability to Navigate Chinese Legal Documents
- 08Smarter Decoding Trick Makes AI Summaries Miss Less
- 09AI Models Now Watch Video and Reason at the Same Time
- 10AI Agent Builds Open-Vocabulary 3D Scenes From Text
- 11360° AI Vision Predicts Any Object in 3D Space
- 12One AI Model That Fixes Blur for Any Camera Lens
- 13Hidden Color Code Found Inside AI Image Generator's Brain
- 14Robotic Hand Places Soft Material Only Where It Counts
- 15Why AI Models Often Prefer Truth — It's About Compression
- Worth Reading
Knowledge Graph RAG Breaks on Multi-Hop Questions — Entity Summaries Fix the Retrieval Phase
Knowledge graphs struggle to answer complex questions because indexing strips away context needed to trace connections across multiple steps. Entity-level summaries that preserve this context—built during indexing rather than at query time—restore the ability to answer "who founded the company that acquired X?" without graph traversal. This breaks the indexing bottleneck that's been silently capping multi-hop reasoning in knowledge graph systems.
- Worth Reading
Using Code as Intermediate Representation Improves VLM Spatial Reasoning by 68.8%
AI image-understanding systems now accurately answer spatial questions like "where is the glass?" by first writing code to map object locations, boosting accuracy by 69%. This breakthrough helps developers build more reliable robots and automation tools that need to understand physical layouts.
- Worth Reading
Imitation Learning Can't Teach Judgment — Agents Trained on Perfect Demos Fail Out-of-Distribution
AI agents trained by copying human experts fail when conditions change slightly—they've never learned what *not* to do. New research shows agents need to experience and learn from failures in safe environments to develop real judgment, making them four times more resilient to unexpected situations.
- Also Worth Noting · 12 notes
- 04AI Search Agent That Learns From Its Own Past Mistakes
- 05Cloning Real Websites So AI Agents Can Practice Safely
- 06AI System That Judges Whether Research Ideas Are Truly New
- 07Graph Transformers Catch Malicious Domains Without Labeled Data
- 08Tiny Model Beats Giants by Training on Point Cloud Data Only
- 09New Tool Ranks AI Reasoning Models More Rigorously
- 10Measuring How Easily Synthetic Data Leaks Real People
- 11Laser Scanning Identifies Street-Level Surface Materials in 3D
- 12AI Image Colors Are Too Vivid — Here's the Fix
- 13Robots That Keep Learning New Tasks Without Forgetting Old Ones
- 14Lightweight LoRA Adapters Clear Hazy Photos Without Labeled Data
- 15Fixing AI's Tendency to "Forget" Images in Long Conversations
- Worth Reading
Diffusion Models Don't Fail at Text Because They Can't Reason — They Fail Because They've Never Seen the Input
Text-to-image AI models fail at rendering complex text and formulas not because they can't reason, but because they've never encountered these inputs during training. GlyphBanana solves this by injecting character templates directly into the model's processing, bypassing the gap entirely—a practical tool for teams automating documents, scientific figures, and multilingual designs without retraining.
- Worth Reading
Unsupervised RLVR Hits a Ceiling Set by the Initial Distribution, Not Compute
A new study reveals that training AI systems through self-improvement has a hard limit set by the initial training data, not raw computing power. Once models exhaust the knowledge embedded in their starting point, they begin collapsing into repetitive, useless outputs—meaning better pre-training data is more critical than throwing more compute at the problem.
- Worth Reading
Sparse Attention Degrades Long-Form Quality in Ways Standard Perplexity Benchmarks Don't Catch
Sparse attention speeds up AI models for massive documents but secretly breaks their ability to connect ideas across long distances—while appearing perfect on standard tests. This discovery exposes a critical blind spot: efficiency tricks that look safe actually cripple reasoning on real long-document tasks, affecting anyone building document search or analysis systems.
- Also Worth Noting · 12 notes
- 04One Shared Model Efficiently Serves Many Different Users
- 05AI That Thinks and Watches Video at the Same Time
- 06AI Agent That Reads Gene Activity to Generate Biology Hypotheses
- 07New Benchmark Tests AI Legal Assistants on Chinese Law
- 08Smarter Decoding Trick Makes AI Summaries Miss Less
- 09AI That Thinks While Watching Video, Not After
- 10AI Agent Builds 3D Scenes From Plain-Text Descriptions
- 11360° AI Vision That Recognizes Objects It's Never Seen
- 12One AI System to Fix Blur Across All Camera Lenses
- 13Hidden Color Code Found Inside AI Image Generator
- 14Robotic Hand Uses Soft Joints and Rigid Links for Better Grip
- 15Why Language Models Lean Toward Truth Without Being Taught To
- Worth Reading
CBCT Tells You Where the Tissue Was. Ultrasound Tells You Where It Is Now.
Surgeons navigate using CT scans that become outdated the moment a patient breathes or tissue shifts. This framework pairs CT with a robotic ultrasound probe that continuously tracks tissue movement in real time, automatically updating the surgical map without needing new scans. It transforms static imaging into live, deformable guidance for abdominal surgery.
- Worth Reading
High-Noise Diffusion Steps Contain Low-Res Information — Processing at Full Resolution Is Wasted Compute
High-noise diffusion models waste compute by processing images at full resolution when early denoising steps only need low-resolution information. This research cuts computational costs by 40% by dynamically lowering resolution in early stages and gradually increasing it as details emerge, enabling faster image generation on phones and cheaper server inference without sacrificing quality.
- Worth Reading
Factual Associations in LLMs Are Stored as Low-Rank Subspaces in Mid-Layer MLP Weights
Scientists pinpointed exactly where language models store facts—in tiny, compressed sections of mid-layer weights—enabling surgical corrections to individual false beliefs without damaging related knowledge. This breakthrough lets AI developers fix errors and update outdated information without expensive retraining, moving toward safer, more maintainable AI systems.
- Also Worth Noting · 12 notes
- 04One Framework to Benchmark All Medical AI Agent Teams
- 05Training AI Agents Using Their Own Live Feedback
- 06One agentic system automates the entire LLM benchmarking pipeline
- 07Active Learning Cuts AI Training Data Needs Dramatically
- 08New Benchmark Tests AI's Ability to Write Threat Intel Reports
- 09AI Model Reads Patient Records Like a Medical Timeline
- 10A Simple Adam Fix That Handles Shifting Time-Series Data
- 11Neural Network Weights Are Data — Here's How to Use Them
- 12Memory-Augmented AI Tracks Oil Spills Across SAR Images
- 13Generating Realistic Bad-Weather Lane Data Without Re-Labeling
- 14Finding Your Location Using Only a Text Description
- 15Teaching AI to Find Usable Spots in Full 360° Rooms