SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents
Hongcheol Cho, Ryangkyung Kang, Youngeun Kim

TL;DR
SkillRet is a comprehensive benchmark dataset designed to evaluate and improve skill retrieval methods in large language model (LLM) agents, highlighting current challenges and potential improvements.
Contribution
Introduces SkillRet, a large-scale, structured skill retrieval benchmark with extensive data, enabling evaluation and training of retrieval models in realistic agent systems.
Findings
Off-the-shelf models perform poorly on large skill libraries.
Task-specific fine-tuning significantly improves retrieval performance.
Fine-tuned models better focus on relevant signals in noisy queries.
Abstract
As LLM agents are increasingly deployed with large libraries of reusable skills, selecting the right skill for a user request has become a critical systems challenge. In small libraries, users may invoke skills explicitly by name, but this assumption breaks down as skill ecosystems grow under tight context and latency budgets. Despite its practical importance, skill retrieval remains underexplored, with limited benchmarks and little understanding of retrieval behavior on realistic skill libraries. To address this gap, we introduce SkillRet, a large-scale benchmark for skill retrieval in LLM agents. SkillRet contains 17,810 public agent skills, organized with structured semantic tags and a two-level taxonomy spanning 6 major categories and 18 sub-categories. It provides 63,259 training samples and 4,997 evaluation queries with disjoint skill pools, enabling both benchmarking and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
