From Tags to Trees: Structuring Fine-Grained Knowledge for Controllable Data Selection in LLM Instruction Tuning
Zihan Niu, Wenping Hu, Junmin Chen, Xiyue Wang, Tong Xu, Ruiming Tang

TL;DR
This paper introduces TAGS, a hierarchical, tree-based data sampling framework for LLM instruction tuning that improves data efficiency and alignment by leveraging fine-grained knowledge structures.
Contribution
It proposes a novel tree-aware sampling method using hierarchical knowledge trees to enhance data selection for LLM tuning, outperforming existing flat or coarse approaches.
Findings
TAGS outperforms state-of-the-art baselines in data efficiency.
Achieves +5.84% performance with only 5% of data.
Further boosts performance by +4.24% with aligned sampling.
Abstract
Effective and controllable data selection is critical for LLM instruction tuning, especially with massive open-source datasets. Existing approaches primarily rely on instance-level quality scores, or diversity metrics based on embedding clusters or semantic tags. However, constrained by the flatness of embedding spaces or the coarseness of tags, these approaches overlook fine-grained knowledge and its intrinsic hierarchical dependencies, consequently hindering precise data valuation and knowledge-aligned sampling. To address this challenge, we propose Tree-aware Aligned Global Sampling (TAGS), a unified framework that leverages a knowledge tree built from fine-grained tags, thereby enabling joint control of global quality, diversity, and target alignment. Using an LLM-based tagger, we extract atomic knowledge concepts, which are organized into a global tree through bottom-up…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
