Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs
Zhongzhi Li, Xuansheng Wu, Yijiang Li, Lijie Hu, and Ninghao Liu

TL;DR
This paper introduces a feature space-based data diversity metric and a synthesis method that improves large language models' performance by generating diverse, task-relevant data, applicable across multiple model families.
Contribution
It proposes Feature Activation Coverage (FAC) as an interpretable diversity metric and a FAC Synthesis framework that enhances data diversity and model performance through feature-aware data generation.
Findings
Improved downstream task performance with synthetic data
Shared feature space enables cross-model knowledge transfer
Enhanced data diversity leads to better instruction following and toxicity detection
Abstract
The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduce Feature Activation Coverage (FAC) which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, named FAC Synthesis, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification
