SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents

Yifan Zhou; Zhentao Zhang; Ziming Cheng; Shuo Zhang; Qizhen Lan; Zhangquan Chen; Zhi Yang; QianyuXu; Ronghao Chen; Huacan Wang; Sen Hu

arXiv:2605.18693·cs.AI·May 19, 2026

SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents

Yifan Zhou, Zhentao Zhang, Ziming Cheng, Shuo Zhang, Qizhen Lan, Zhangquan Chen, Zhi Yang, QianyuXu, Ronghao Chen, Huacan Wang, Sen Hu

PDF

TL;DR

SkillGenBench is a new benchmark designed to evaluate the ability of language models to generate correct, reusable, and executable skills from repositories and documents, addressing a key challenge in building effective LLM agents.

Contribution

It introduces a unified, controlled protocol for assessing skill generation pipelines, covering multiple generation regimes and procedural sources, with standardized evaluation procedures.

Findings

01

Performance varies significantly across different skill-generation methods.

02

Reusability and correctness of generated skills remain challenging.

03

Distinct failure modes are observed between repository-based and document-based skill generation.

Abstract

As LLM agents are increasingly built around reusable skills, a central challenge is no longer only whether agents can use provided skills, but whether they can generate correct, reusable, and executable skills from repositories and documents. Existing benchmarks primarily evaluate the efficacy of given skills or the ability of agents to solve downstream tasks from raw context, but they do not isolate skill generation itself as the object of study. We introduce SkillGenBench, a benchmark for evaluating skill generation pipelines under a unified and controlled protocol. In SkillGenBench, a generator receives raw corpora and produces standardized skill artifacts, which are then executed under fixed harnesses and assessed with unified evaluation procedures. The benchmark covers two generation regimes: task-conditioned generation, where a task-specific skill is synthesized after the task is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.