Brief · AI research papers, explained for builders

Jun 21, 2026Sunday6 entries

Worth Reading
Strip the Leakage, and the LLM Forecasting Edge Mostly Disappears
A 36-month leakage-controlled test shows a 7B RAG forecaster's median IC of +0.154 is largely explained by macro-analog retrieval, not LLM capability.
Also Worth Noting · 5 notes
1. 02MacAgentBench: Benchmarking AI Agents on Real-World macOS Desktop
2. 03Learning at the Right Pace: Adaptive Data Scheduling Improves LLM Reinforcement Learning
3. 04BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories
4. 05Beyond Penalizing Mistakes: Stabilizing Efficiency Training in Large Reasoning Models via Adaptive Correct-Only Rewards
5. 06SVGym (SciVerseGym): An Environment for Reinforcement Learning and Bayesian Optimization in Crystal Discovery

Jun 20, 2026Saturday6 entries

Jun 19, 2026Friday6 entries

Jun 18, 2026Thursday6 entries

Jun 17, 2026Wednesday6 entries

Jun 16, 2026Tuesday6 entries

Jun 15, 2026Monday6 entries

Jun 14, 2026Sunday6 entries

Jun 13, 2026Saturday6 entries

Jun 12, 2026Friday6 entries

Jun 11, 2026Thursday6 entries

Jun 10, 2026Wednesday6 entries

Jun 9, 2026Tuesday6 entries

Jun 8, 2026Monday6 entries

Jun 7, 2026Sunday6 entries

Jun 6, 2026Saturday6 entries

Jun 5, 2026Friday6 entries

Jun 4, 2026Thursday6 entries

Jun 3, 2026Wednesday6 entries

Jun 2, 2026Tuesday6 entries

Jun 1, 2026Monday6 entries

May 31, 2026Sunday6 entries

May 29, 2026Friday6 entries

May 28, 2026Thursday6 entries

May 27, 2026Wednesday6 entries

May 26, 2026Tuesday6 entries

May 25, 2026Monday6 entries

May 24, 2026Sunday6 entries

May 22, 2026Friday6 entries

May 21, 2026Thursday6 entries

May 20, 2026Wednesday6 entries

May 19, 2026Tuesday6 entries

May 18, 2026Monday6 entries

May 17, 2026Sunday6 entries

May 16, 2026Saturday6 entries

May 15, 2026Friday6 entries

May 14, 2026Thursday6 entries

May 12, 2026Tuesday6 entries

May 11, 2026Monday6 entries

May 10, 2026Sunday6 entries

May 9, 2026Saturday6 entries

May 8, 2026Friday6 entries

May 7, 2026Thursday6 entries

May 6, 2026Wednesday6 entries

May 5, 2026Tuesday6 entries

May 4, 2026Monday6 entries

May 3, 2026Sunday6 entries

May 2, 2026Saturday4 entries

Apr 30, 2026Thursday5 entries

Apr 27, 2026Monday6 entries

Apr 23, 2026Thursday6 entries

Apr 22, 2026Wednesday6 entries

Apr 21, 2026Tuesday6 entries

Apr 20, 2026Monday1 entry

Worth Reading
Hyatt Deployed ChatGPT Enterprise Globally: Rollout Details
Hyatt became the first major hospitality chain to deploy AI across its entire global workforce, using specialized tools for different jobs—general AI for office work and code-generation AI for technical tasks. This move signals that enterprise AI adoption is maturing beyond experiments, and shows other large companies that matching the right AI tool to specific work categories, rather than forcing one model everywhere, delivers better results.

Apr 19, 2026Sunday6 entries

Apr 18, 2026Saturday6 entries

Apr 17, 2026Friday6 entries

Apr 16, 2026Thursday6 entries

Apr 14, 2026Tuesday6 entries

Apr 13, 2026Monday1 entry

Worth Reading
The Cloud Infrastructure Stack for AI Agents Is Consolidating Fast
Cloudflare now runs advanced AI models directly on its global network, letting enterprises build and deploy AI agents without juggling multiple vendors for security, routing, and secrets management. This matters for regulated industries where data must stay within strict boundaries—the real win is infrastructure consolidation, not new AI capabilities.

Apr 11, 2026Saturday6 entries

Apr 10, 2026Friday6 entries

Apr 9, 2026Thursday6 entries

Apr 8, 2026Wednesday6 entries

Apr 7, 2026Tuesday6 entries

Apr 6, 2026Monday3 entries

Apr 5, 2026Sunday6 entries

Apr 4, 2026Saturday6 entries

Apr 3, 2026Friday6 entries

Apr 2, 2026Thursday6 entries

Apr 1, 2026Wednesday6 entries

Mar 31, 2026Tuesday6 entries

Mar 30, 2026Monday1 entry

Worth Reading
Bayesian Optimization Addresses Concrete Mix Design's Data Problem
For decades, concrete makers relied on expensive trial-and-error lab testing to find the right ingredient mix. Meta's new AI model cuts this dramatically by learning from each test to predict which next experiment reveals the most valuable information—reducing both the number of physical tests needed and carbon emissions from cement-heavy formulas.

Mar 29, 2026Sunday6 entries

Mar 28, 2026Saturday6 entries

Mar 26, 2026Thursday6 entries

Mar 23, 2026Monday1 entry

Worth Reading
OpenAI's Safety Stack for Sora 2 Reveals How Hard Real-Time Video Moderation Actually Is
Real-time video generation breaks old safety tools designed for images—watermarks degrade under compression, and new user behaviors outpace single-layer defenses. OpenAI's Sora now combines prompt filtering, output classification, and platform enforcement across multiple layers to catch harmful content at scale, but developers building on video APIs can't rely on upstream safety alone.

Mar 22, 2026Sunday3 entries

Mar 19, 2026Thursday9 entries

Mar 18, 2026Wednesday5 entries

Mar 17, 2026Tuesday13 entries

Mar 15, 2026Sunday15 entries

Mar 14, 2026Saturday15 entries

Mar 13, 2026Friday15 entries

Mar 12, 2026Thursday15 entries

Mar 10, 2026Tuesday15 entries

Strip the Leakage, and the LLM Forecasting Edge Mostly Disappears

Robots That Play First Solve Tasks Better: 20-Point Gains Without Extra Instructions

ContextRL Trains Models to Find the One Sentence That Actually Matters

SAE Feature Clamping Gets a 95.8% Bypass Rate

The Field's Go-To GUI Agent Dataset Actively Breaks Fine-Tuning

Same Success Rate, Completely Different Failure Modes: Web Agent Eval Is Broken

Expert Exam Scores Don't Predict Medical LLM Reliability Under Pressure

Frozen Safety Monitors Break After Fine-Tuning, Not After Quantization

EvoTrainer: Fixing the Training Harness While Tuning the Policy Is a False Economy

Two Tokens Fix Hidden-State Recurrence: SWITCH Makes Latent Reasoning RL-Trainable

MiniMax Sparse Attention Cuts Million-Token Compute by 28x Without Quality Loss

The Safety Tool That Became a Jailbreak: GCD's Hidden Attack Surface

CoT Fine-Tuning Quietly Destroys Long-Context Recall in Hybrid LLMs

PPO's Ratio Clipping Has a Blind Spot. DRPO Fixes It.

On-Policy Distillation Breaks at the Prefix, Not the Token

MoE-to-Dense Conversion Beats Dense Pruning by 6.3 Points

The Part of Your LLM You Throw Away Is Quietly Corrupting Your Embeddings

Code2LoRA Matches Per-Repo Fine-Tuning at Zero Inference Token Cost

AI's Deployment Gap Is an Evaluation Problem, Not a Capability Problem

NTP's One-Hot Supervision Leaves Representation Space Broken by Design

Agentic Inference Is Structurally Wasteful: LayerRoute Fixes It in 6 Minutes

On-Policy Distillation Without Logit Access: +28.64% on Math

LoRA Has a Memory Ceiling, and Now You Can Calculate It

Safety Benchmarks May Be Measuring Evaluation Awareness, Not Alignment

Safety Benchmarks May Measure Test-Awareness, Not Alignment

Agentic RL Training Actively Degrades Tool Judgment: A Fix in 18% Fewer Calls

AI Research Agents Fabricate Citations at 21%: A Verifiability Crisis

Silencing Agents Beats Letting Them Talk: DarkForest Cuts Errors 30.7%

Most Tokens in a Correct Response Are Getting the Wrong Credit Signal

RLVR Fine-Tuning Is Geometrically Wasteful: Rank-1 Extrapolation Matches Full Training

Video MLLMs Fake Audio Understanding: Visual Hallucination at Scale

Post-Trained MoE Models Can Skip Half Their Experts Without Retraining

Post-Trained MoE Can Skip Half Its Experts Without Retraining

Frontier Research Agents Pass at Under 22% on Consulting-Grade Work

Dense Teacher Supervision Breaks Multi-Turn Agents. SDAR Fixes It.

Agents Score 55% on Belief Invalidation: The Silent Memory-Rot Problem

One Base Model, One Million Policies: MinT's LoRA Adapter Architecture

On-Policy Distillation Hurts When the Teacher's Context Is Wrong

MatryoshkaLoRA: One Training Run, Every Rank You Need

LEAD Cuts Chain-of-Thought Length Without Accuracy Loss

SkillOS: The Curation Bottleneck That Keeps LLM Agents Stuck at Zero

Every LLM Throws Away Token Identity After Layer One. TIDE Doesn't.

MoE Experts Are Entangled by Default. EMO Fixes That at 1B Scale.

SFT Before RL Is Actively Hurting Your Multimodal Model

GUI Agents Top Out at 21% on Multi-App Tasks That Mirror Real Work

The Most Engineered Agentic Gateway Scores 0.000 on Every Safety Check

A Single Tool Call Can Poison an Agent's Memory for 100+ Sessions

Cross-Architecture dLLM Distillation: 0.6B Student, 48.78 HumanEval

Your LangGraph Orchestrator Is Failing 24% of Travel Conversations

Safety Signal Lives Inside the Model, Not Just at the End

On-Policy Distillation Makes Models More Accurate and More Overconfident — Simultaneously

Learning Adaptive Reasoning Paths for Efficient Visual Reasoning

Agents Can Now Learn From Their Own Past Reasoning, Without Retraining

Hyatt Deployed ChatGPT Enterprise Globally: Rollout Details

Deep Research Agents Fail on the Basics — and Current Benchmarks Can't See It

Reward models that explain themselves outperform those that just score

Computer-Use Agents Fail Safety Tests Even When Users Do Everything Right

Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

The 3D point cloud field has a reproducibility problem, and it's structural

The Cloud Infrastructure Stack for AI Agents Is Consolidating Fast

The 3D registration benchmark problem nobody fixed: models trained on perfect data, tested on perfect data

Training 100B+ Models Without a Cluster: Memory Architecture Beats Hardware Scale

Agents Keep Relearning the Same Lessons: SkillX Builds a Shared Curriculum Instead

When the teacher cheats, the student memorizes instead of learns

AI-generated videos are too consistent — and that's exactly how to catch them

How Meta Used AI to Map Tribal Knowledge in Large-Scale Data Pipelines

Open-ended discovery systems are not truly open-ended; CORAL is the first framework to make them autonomous

LoRA Isn't the Default for Hybrid Models Anymore

Treating vision and audio as second-class citizens has a cost

AI agents read physics papers but do not reproduce them.

Training in the Deployment Harness Closes the Benchmark-Production Gap

Diffusion policy RL has a hidden unification problem — and it's slowing everyone down

Bayesian Optimization Addresses Concrete Mix Design's Data Problem

Cost volumes are stereo matching's sacred cow — warping alone just dethroned them

Video Agents That Decide What to Watch Before Watching It

Long video QA breaks when models ignore what the video is already telling them

Deep research agents do not need the internet; they need the right offline corpus

DoRA's memory wall breaks at high rank: a systems fix, not a math fix

OpenAI's Safety Stack for Sora 2 Reveals How Hard Real-Time Video Moderation Actually Is

3D reasoning in VLMs stems from perception issues, not language processing.