Also Worth Noting - 2026-06-13

Scaling ceilings cracked open: five papers on environment composition, repo-level agents, 2-step diffusion, combinatorics evals, and inference compression

Also Worth Noting

02 [Training] Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization More manual tasks is not the right axis for scaling RL reasoning. RACES treats verifiable environments as composable units, combining small tasks recursively to generate exponentially more training signal without linear construction costs. The key finding is that composed environments produce reasoning transfer that individually constructed environments cannot, meaning the composition structure itself carries information about generalization. Teams building RL pipelines for reasoning models should consider environment composition before expanding task libraries. link

03 [Agent] DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch Patch-level benchmarks like SWE-bench understate how far code agents are from real software engineering. DeNovoSWE shifts the target to whole-repository generation from high-level specifications, assembling 4,818 verifiable instances where each requires architecting and implementing a complete codebase. The dataset's primary contribution is exposing the data gap at this task level rather than closing it. Teams training long-horizon code agents will find this a useful diagnostic of exactly where current pipelines break down. link

04 [Inference] High-Fidelity Two-Step Image Generation via Teacher-Aligned End-to-End Distillation Cutting diffusion steps from 4 to 2 is not a linear quality trade-off. Z-Image Turbo++ identifies model capacity as the binding constraint at 2 steps, not the distillation loss, and addresses it with Distribution-Aligned Adversarial Learning that uses teacher-generated images rather than external real images as the discriminator target. The result is competitive quality at exactly 2 steps, distilled from an 8-step teacher. For teams running real-time image generation at scale, this reframes the optimization target: make the distilled model bigger before tuning the loss. link

05 [Eval] ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics Top frontier models score unevenly on Olympiad combinatorics for a specific structural reason: the tasks require constructive existence proofs, not pattern retrieval. ComBench isolates this with 100 human-annotated competition-level problems organized around two complementary settings, exposing a failure mode that arithmetic and algebra benchmarks do not surface. This is a different ceiling than compute or context length. Teams evaluating mathematical reasoning capabilities should add ComBench to distinguish creative construction ability from symbolic manipulation. link

06 [Inference] Adaptive Multi-Resolution Procedural Knowledge Compression for Large Language Models Injecting full skill text into every LLM context is a silent inference tax that compounds at production scale. Existing compression methods target factual documents, not procedural knowledge, so they degrade task performance when applied to reusable skills. This approach uses resolution-adaptive compression that matches compression depth to how frequently a skill is invoked, recovering prefill cost and latency with measurable quality retention. For teams running LLM workflows with repeated skill calls, this is a directly applicable optimization before any hardware upgrade. link