SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li; Wenbo Chen; Yimin Liu; Shenghan Zheng; Xiaokun Chen; Yifeng He; Yubo Li; Bingran You; Haotian Shen; Jiankai Sun; Shuyi Wang; Binxu Li; Qunhong Zeng; Di Wang; Xuandong Zhao; Yuanli Wang; Roey Ben Chaim; Zonglin Di; Yipeng Gao; Junwei He; Yizhuo He; Liqiang Jing; Luyang Kong; Xin Lan; Jiachen Li; Songlin Li; Yijiang Li; Yueqian Lin; Xinyi Liu; Xuanqing Liu; Haoran Lyu; Ze Ma; Bowei Wang; Runhui Wang; Tianyu Wang; Wengao Ye; Yue Zhang; Hanwen Xing; Yiqi Xue; Steven Dillmann; and Han-chung Lee

arXiv:2602.12670·cs.AI·March 16, 2026

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing

PDF

Open Access

TL;DR

SkillsBench provides a comprehensive benchmark to evaluate the effectiveness of agent skills across diverse tasks, revealing that curated skills significantly improve performance, while self-generated skills do not.

Contribution

Introduces SkillsBench, a standardized benchmark with curated skills and verifiers to measure skill impact on agent performance across multiple domains.

Findings

01

Curated Skills increase pass rates by 16.2 percentage points on average.

02

Effects of Skills vary significantly across domains, from +4.5pp to +51.9pp.

03

Self-generated Skills do not provide average benefits, indicating models struggle to produce effective procedural knowledge.

Abstract

Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas. Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills with 2--3 modules outperform comprehensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Topic Modeling · Artificial Intelligence in Healthcare and Education