Also Worth Noting - 2026-05-05

Five papers on training data strategy, eval reliability, linear-time inference, video generation efficiency, and a medical AI benchmark.

Also Worth Noting

02 [Training] Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling For German-scale corpora, repeating a small high-quality filtered subset across multiple epochs beats a single pass over large lightly filtered data. The finding inverts the common instinct to maximize coverage: aggressive filtering across 500M web documents produces a higher-signal core that, when repeated, outperforms diversity-first training. The tradeoff holds even when the filtered set sacrifices substantial vocabulary coverage. Multilingual pretraining teams working outside English should treat this as a direct signal to revisit their filtering thresholds before scaling compute. link

03 [Eval] Counting as a minimal probe of language model reliability Models that ace math and coding benchmarks still fail at stable counting, a task with no knowledge dependencies, no semantics, and no tokenization confounds. The Stable Counting Capacity assay works by asking models to count repeated symbols until they break, stripping away every scaffold that benchmarks typically leave in place. Failure patterns reveal that benchmark success reflects memorized procedures rather than general rule execution. Anyone using eval scores to gate deployment should treat counting failure as a warning sign about the brittleness of that signal. link

04 [Inference] Linear-Time Global Visual Modeling without Explicit Attention Attention's global modeling power does not require explicit token-wise aggregation. Mathematically, attention is equivalent to an MLP with dynamically predicted parameters, where those parameters act as a compressed representation of the full context. That reframing enables linear-time global sequence modeling without sparse approximations or kernel tricks. For teams running long-context inference at scale, this changes the cost calculus: global modeling no longer has to mean quadratic compute. link

05 [Inference] Motion-Aware Caching for Efficient Autoregressive Video Generation Chunk-level cache skipping in autoregressive video generation applies the same skip rate to high-motion and static pixels alike, which wastes compute where it matters most and over-caches where it matters least. Motion-aware caching formalizes this asymmetry: pixels with high motion require more denoising steps to prevent error accumulation, while static pixels tolerate aggressive skipping. Operating at pixel granularity rather than chunk granularity fixes both failure modes simultaneously. Teams running autoregressive video pipelines should see direct throughput gains without architectural changes. link

06 [Application] Assessing Pancreatic Ductal Adenocarcinoma Vascular Invasion: the PDACVI Benchmark Surgical eligibility for pancreatic cancer currently depends on radiologist assessment of vascular invasion, a judgment with substantial inter-rater variability even among experts. PDACVI is the first public benchmark targeting computational staging of this specific decision, addressing both the absence of public datasets and the diagnostic ambiguity at the tumor-vessel interface. The benchmark gives medical AI teams a concrete evaluation target where none previously existed. Groups building surgical planning tools now have a standardized reference point to measure against. link