SkillOS: The Curation Bottleneck That Keeps LLM Agents Stuck at Zero

SkillOS uses RL to train a dedicated skill curator that filters and evolves reusable agent experience, beating memory-based baselines on multi-turn tasks.

Most deployed LLM agents treat every session as a clean slate. The assumption baked into that design is reasonable enough: storing past experience is easy; knowing which experience is worth keeping is hard. SkillOS runs that assumption to its logical conclusion and finds the hard part is harder than most skill-memory systems acknowledge.

The standard approach to agent memory either hands a human the curation job, applies fixed heuristics to decide what gets stored, or trains on short task windows where the benefit of a skill shows up immediately. None of those paths teach an agent to reason about which skills will pay off several tasks later. That long-horizon signal is exactly what's missing, and it's the reason skill repositories in production tend to fill with low-quality distillations that hurt retrieval more than they help.

SkillOS splits the agent into two components with distinct roles. A frozen executor retrieves skills from an external repository and applies them to the task at hand. A separate, trainable curator decides what gets written to that repository, how existing entries get updated, and when something should be pruned. The curator is the only part that learns. Think of it as separating the worker from the librarian: the worker executes, the librarian maintains the collection so future workers find what they actually need.

Training the curator requires a signal that reflects long-term curation quality, not just immediate task performance. SkillOS handles this with composite rewards and a grouped task stream. Tasks are ordered by skill-relevant dependency: earlier tasks in a group update the repository, and later related tasks evaluate whether those updates were useful. The reward flows backward to the curator based on how well the downstream tasks went. This is the mechanism that prior work lacked. Short-horizon training never sees that delayed feedback; SkillOS is built around it.

Across multi-turn agentic benchmarks and single-turn reasoning tasks, SkillOS beats both memory-free baselines and strong memory-based systems on effectiveness and efficiency. The learned curator generalizes across different executor backbones and task domains, meaning the curation policy transfers even when the underlying model changes. Analysis of the repository contents shows the skills themselves evolve over time into richer, more structured Markdown files encoding higher-level meta-skills rather than flat procedural traces. For teams building agents that need to improve across sessions rather than just within them, the takeaway is direct: the curation layer deserves as much engineering attention as the retrieval layer.

We're thinking: We read SkillOS as a direct challenge to the way most agent memory work is evaluated. The field has optimized for skill storage and retrieval accuracy, treating curation as a preprocessing step. SkillOS shows that when you make curation a first-class learning problem with delayed reward signals, the quality of what ends up in the repository improves enough to matter on downstream tasks. The specific implication for teams is uncomfortable: if your current skill memory system was designed around heuristic filters or human review, you may be accumulating a repository that actively degrades agent performance at scale, and you won't see it until the task horizon is long enough to expose it.

Key takeaways:

SkillOS separates a frozen executor from a trainable skill curator, using grouped task streams with delayed composite rewards to teach long-horizon curation policy rather than short-window heuristics.
The learned curator generalizes across executor backbones and task domains, and repository contents evolve into structured meta-skill files over time; caveat: results are on benchmark task streams, and real production distributions may require additional domain-specific reward shaping.
Teams building multi-session agents with skill or memory repositories should treat curation as a trainable policy, not a fixed filter, and evaluate it on task streams long enough to surface delayed feedback.

Source: SkillOS: Learning Skill Curation for Self-Evolving Agents