Also Worth Noting - 2026-06-21

Two eval frameworks exposing hidden gaps, two GRPO training fixes, and one materials-science RL environment.

Also Worth Noting

02 [Eval] MacAgentBench: Benchmarking AI Agents on Real-World macOS Desktop Binary pass/fail scores for desktop agents hide most of what matters. MacAgentBench covers 676 tasks across 25 applications on macOS, with nearly 60% requiring multiple apps, and scores partial task completion rather than treating a near-miss as a full failure. Critically, it evaluates agents inside their actual automation frameworks, meaning prior benchmark numbers collected without framework augmentation are not comparable to real deployment conditions. Teams shipping always-on Mac automation should treat existing binary-pass scores as lower bounds, not baselines. link

03 [Training] Learning at the Right Pace: Adaptive Data Scheduling Improves LLM Reinforcement Learning Uniform sampling during RL post-training wastes compute on problems the policy has already solved. Adaptive Data Scheduling (ADS) replaces flat sampling with a dual-level scheme: at the cluster level it organizes training data by semantic similarity, and at the sample level it selects examples near the policy's current capability boundary. The result is measurably better math and coding reasoning scores under GRPO without changing the underlying model architecture. Teams running RL post-training pipelines should treat data scheduling as a first-class hyperparameter, not an afterthought. link

04 [Eval] BabelJudge: Measuring LLM-as-a-Judge Reliability Across Languages and Agent Trajectories Position bias and verbosity bias in LLM judges are significantly worse in lower-resource languages, meaning a single judge model used across a multilingual pipeline is systematically miscalibrated in ways raw accuracy numbers do not surface. BabelJudge audits four failure modes simultaneously: position bias, verbosity bias, order inconsistency, and cross-lingual degradation, all without requiring human annotations. The framework is open-source and model-agnostic. Any team running multilingual evaluation with a single judge should run a BabelJudge audit before trusting aggregate scores. link

05 [Training] Beyond Penalizing Mistakes: Stabilizing Efficiency Training in Large Reasoning Models via Adaptive Correct-Only Rewards Adding length penalties to GRPO does not just fail to reduce verbosity , it actively destroys reasoning capability through a group-normalization artifact. When incorrect answers receive continuous length penalties, GRPO's group normalization flips the advantage signal for correct-but-verbose answers, turning them negative and collapsing the reward signal. Filtering length rewards to apply only on correct answers eliminates the collapse mechanism entirely while preserving brevity gains. Teams using GRPO for reasoning efficiency training should audit whether their reward configuration exposes correct answers to this normalization flip. link

06 [Application] SVGym (SciVerseGym): An Environment for Reinforcement Learning and Bayesian Optimization in Crystal Discovery Crystal structure search has no shared scoreboard: every method runs on its own bespoke pipeline, making direct comparison impossible. SciVerseGym wraps the full crystal discovery loop , structure editing, relaxation, scoring, and constraint checking , as a Gymnasium-compatible Markov decision process, so RL agents and Bayesian optimization methods can compete on identical footing for the first time. Agents observe atomistic structures, apply chemically meaningful edits, and receive feedback from a configurable evaluator. ML teams working on materials discovery now have a standard environment to benchmark against rather than rebuilding evaluation infrastructure from scratch. link