From Tags to Trees: Structuring Fine-Grained Knowledge for Controllable Data Selection in LLM Instruction Tuning

Zihan Niu; Wenping Hu; Junmin Chen; Xiyue Wang; Tong Xu; Ruiming Tang

arXiv:2601.13995·cs.CL·January 21, 2026

From Tags to Trees: Structuring Fine-Grained Knowledge for Controllable Data Selection in LLM Instruction Tuning

Zihan Niu, Wenping Hu, Junmin Chen, Xiyue Wang, Tong Xu, Ruiming Tang

PDF

Open Access

TL;DR

This paper introduces TAGS, a hierarchical, tree-based data sampling framework for LLM instruction tuning that improves data efficiency and alignment by leveraging fine-grained knowledge structures.

Contribution

It proposes a novel tree-aware sampling method using hierarchical knowledge trees to enhance data selection for LLM tuning, outperforming existing flat or coarse approaches.

Findings

01

TAGS outperforms state-of-the-art baselines in data efficiency.

02

Achieves +5.84% performance with only 5% of data.

03

Further boosts performance by +4.24% with aligned sampling.

Abstract

Effective and controllable data selection is critical for LLM instruction tuning, especially with massive open-source datasets. Existing approaches primarily rely on instance-level quality scores, or diversity metrics based on embedding clusters or semantic tags. However, constrained by the flatness of embedding spaces or the coarseness of tags, these approaches overlook fine-grained knowledge and its intrinsic hierarchical dependencies, consequently hindering precise data valuation and knowledge-aligned sampling. To address this challenge, we propose Tree-aware Aligned Global Sampling (TAGS), a unified framework that leverages a knowledge tree built from fine-grained tags, thereby enabling joint control of global quality, diversity, and target alignment. Using an LLM-based tagger, we extract atomic knowledge concepts, which are organized into a global tree through bottom-up…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification