Also Worth Noting - 2026-05-15

Five papers on closing gaps in robot manipulation, agentic safety, long-context training, LLM routing, and open-ended code generation

Also Worth Noting

02 [Agent] IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation Frame-conditioned VLA policies randomly resample action chunks when two observations look identical but diverge in intent, producing inter-chunk conflicts that destabilize execution. IntentVLA adds a history-conditioned intent token encoding recent visual context, giving the policy a short-horizon anchor that resolves this aliasing before each replanning step. The fix targets a failure mode that shows up specifically under partial observability, which covers most real manipulation deployments. Teams running imitation-learned manipulation policies should check whether chunk-resampling instability, not model capacity, is the actual bottleneck. link

03 [Application] LiSA: Lifelong Safety Adaptation via Conservative Policy Induction A guardrail that blocks credential access is correct in a customer-service agent and wrong in a DevOps one, yet most static guardrails treat context as fixed at deployment time. LiSA uses conservative policy induction to update guardrail rules from deployment feedback without full retraining, letting the system absorb organizational norms and local privacy expectations it could not have been pre-specified against. The approach directly targets the class of contextual failures, secret leakage, unauthorized tool calls, where answer-quality errors become concrete deployment harms. Teams shipping agentic systems into varied organizational environments have a concrete adaptation mechanism to evaluate here. link

04 [Training] Long Context Pre-Training with Lighthouse Attention Lighthouse Attention is stripped out entirely before inference, so the deployed model is a standard transformer with no efficiency-vs-compatibility trade-off at serving time. During training, a gradient-free hierarchical selection mechanism wraps ordinary scaled dot-product attention and reduces its quadratic cost at extreme sequence lengths, enabling long-context pre-training that would otherwise be memory-prohibitive. Because the modification lives only in the training graph, teams can adopt it without changing inference infrastructure, quantization pipelines, or serving kernels. That separation of concerns makes it worth evaluating for any pre-training run targeting sequences beyond 32K tokens. link

05 [Inference] RouteProfile: Elucidating the Design Space of LLM Profiles for Routing Profile design choices, specifically which benchmarks and aggregation methods represent model capabilities, can swing routing accuracy by more than the router architecture itself, a variable most routing research holds constant. RouteProfile systematically maps this design space, disentangling profile construction from router mechanism so the two can be evaluated and improved independently. The finding reframes where engineering effort pays off: optimizing the router on a poorly constructed profile is building on sand. Teams operating multi-model serving stacks should audit their capability profiles before investing further in router mechanism design. link

06 [Open-source] FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale Competitive programming benchmarks have a hard ceiling because every problem has a known correct answer, which means training on them does not prepare models for real-world engineering tasks where correctness is not binary. FrontierSmith iteratively evolves open-ended problems from existing closed-ended tasks, generating a large-scale corpus where no optimal solution exists and quality must be judged by other means. LLMs trained on this data show measurable gains on practical engineering tasks that closed-ended training data fails to cover. For teams building code-generation models, this points to a data construction strategy rather than a model architecture change. link