SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

Shanshan Zhong; Yi Lu; Jingjie Ning; Yibing Wan; Lihan Feng; Yuyi Ao; Leonardo F. R. Ribeiro; Markus Dreyer; Sean Ammirati; Chenyan Xiong

arXiv:2604.20087·cs.CL·April 23, 2026

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

Shanshan Zhong, Yi Lu, Jingjie Ning, Yibing Wan, Lihan Feng, Yuyi Ao, Leonardo F. R. Ribeiro, Markus Dreyer, Sean Ammirati, Chenyan Xiong

PDF

1 Repo

TL;DR

SkillLearnBench is a new benchmark for evaluating continual skill learning methods for real-world tasks, revealing that current techniques improve over no-skill baselines but face challenges in generalization and scaling.

Contribution

The paper introduces SkillLearnBench, the first comprehensive benchmark for continual skill learning, and evaluates various methods, highlighting their strengths and limitations across diverse real-world tasks.

Findings

01

All continual learning methods outperform no-skill baseline.

02

Scaling to stronger LLMs does not reliably improve skills.

03

External feedback facilitates genuine improvement, self-feedback can cause drift.

Abstract

Skills have become the de facto way to enable LLM agents to perform complex real-world tasks with customized instructions, workflows, and tools, but how to learn them automatically and effectively remains unclear. We introduce SkillLearnBench, the first benchmark for evaluating continual skill learning methods, comprising 20 verified, skill-dependent tasks across 15 sub-domains derived from a real-world skill taxonomy , evaluated at three levels: skill quality, execution trajectory, and task outcome. Using this benchmark, we evaluate recent continual learning techniques, those leveraging one-shot, self/teacher feedback, and skill creator to generate skills from agent experiences. We find that all continual learning methods improve over the no-skill baseline, yet consistent gains remain elusive: no method leads across all tasks and LLMs, and scaling to stronger LLMs does not reliably…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cxcscmu/SkillLearnBench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.