TL;DR
Corpus2Skill enables LLM agents to navigate structured enterprise knowledge hierarchically, improving answer quality and grounding by organizing documents into a navigable skill directory, with scope-dependent effectiveness.
Contribution
This work introduces Corpus2Skill, a method to distill document corpora into hierarchical skill directories for improved navigation and knowledge grounding in LLM-based QA systems.
Findings
Navigation improves answer quality on enterprise support benchmarks.
Corpus2Skill outperforms various retrieval baselines in answer grounding.
Effectiveness depends on corpus structure and topical taxonomy.
Abstract
Retrieval-Augmented Generation (RAG) grounds LLM responses in external evidence but treats the model as a passive consumer of search results, with no view of how the corpus is organized or what it has not yet seen. We present Corpus2Skill, which distills a document corpus offline into a hierarchical skill directory and lets an LLM agent navigate it at serve time, drilling from a bird's-eye view through progressively finer summaries down to documents, and backtracking when a branch is unproductive. On an enterprise customer-support benchmark, Corpus2Skill improves both answer quality and grounding over single-shot dense, hybrid, hierarchical-retrieval, and agentic RAG baselines at a moderate cost tradeoff. A ten-subset generalization study further shows that corpus navigation is not a universal replacement for retrieval: it consistently helps on single-domain corpora with a recoverable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
