How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

Yujian Liu; Jiabao Ji; Li An; Tommi Jaakkola; Yang Zhang; Shiyu Chang

arXiv:2604.04323·cs.CL·April 7, 2026

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

Yujian Liu, Jiabao Ji, Li An, Tommi Jaakkola, Yang Zhang, Shiyu Chang

PDF

1 Repo

TL;DR

This paper benchmarks the effectiveness of agentic skills in realistic settings for LLM-based agents, revealing performance drops and proposing refinement strategies to improve skill utility.

Contribution

It provides the first comprehensive study of skill utility in realistic scenarios, highlighting limitations and proposing retrieval and refinement methods to enhance performance.

Findings

01

Performance gains from skills decline in realistic settings.

02

Query-specific refinement significantly recovers lost performance.

03

Retrieval and refinement improve pass rates on Terminal-Bench 2.0.

Abstract

Agent skills, which are reusable, domain-specific knowledge artifacts, have become a popular mechanism for extending LLM-based agents, yet formally benchmarking skill usage performance remains scarce. Existing skill benchmarking efforts focus on overly idealized conditions, where LLMs are directly provided with hand-crafted, narrowly-tailored task-specific skills for each task, whereas in many realistic settings, the LLM agent may have to search for and select relevant skills on its own, and even the closest matching skills may not be well-tailored for the task. In this paper, we conduct the first comprehensive study of skill utility under progressively challenging realistic settings, where agents must retrieve skills from a large collection of 34k real-world skills and may not have access to any hand-curated skills. Our findings reveal that the benefits of skills are fragile:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

UCSB-NLP-Chang/Skill-Usage
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.