SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?

Shiqi Chen; Jingze Gai; Ruochen Zhou; Jinghan Zhang; Tongyao Zhu; Junlong Li; Kangrui Wang; Zihan Wang; Zhengyu Chen; Klara Kaleb; Ning Miao; Siyang Gao; Cong Lu; Manling Li; Junxian He; Yee Whye Teh

arXiv:2603.00718·cs.CL·March 11, 2026

SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?

Shiqi Chen, Jingze Gai, Ruochen Zhou, Jinghan Zhang, Tongyao Zhu, Junlong Li, Kangrui Wang, Zihan Wang, Zhengyu Chen, Klara Kaleb, Ning Miao, Siyang Gao, Cong Lu, Manling Li, Junxian He, Yee Whye Teh

PDF

Open Access

TL;DR

SkillCraft is a new benchmark designed to evaluate and improve the ability of large language model agents to learn, compose, and reuse higher-level skills in complex, realistic tool-using scenarios, emphasizing efficiency and skill abstraction.

Contribution

The paper introduces SkillCraft, a benchmark that tests agents' ability to form and reuse compositional skills, along with a lightweight evaluation protocol for auto-assembling and caching skills.

Findings

01

Agents show up to 80% reduction in token usage through skill reuse.

02

Success correlates with the ability to compose and reuse tools.

03

Skill abstraction enhances efficiency and performance.

Abstract

Real-world tool-using agents operate over long-horizon workflows with recurring structure and diverse demands, where effective behavior requires not only invoking atomic tools but also abstracting, and reusing higher-level tool compositions. However, existing benchmarks mainly measure instance-level success under static tool sets, offering limited insight into agents' ability to acquire such reusable skills. We address this gap by introducing SkillCraft, a benchmark explicitly stress-test agent ability to form and reuse higher-level tool compositions, where we call Skills. SkillCraft features realistic, highly compositional tool-use scenarios with difficulty scaled along both quantitative and structural dimensions, designed to elicit skill abstraction and cross-task reuse. We further propose a lightweight evaluation protocol that enables agents to auto-compose atomic tools into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Machine Learning in Materials Science · Machine Learning and Data Classification