Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs

Zhongzhi Li; Xuansheng Wu; Yijiang Li; Lijie Hu; and Ninghao Liu

arXiv:2602.10388·cs.CL·February 16, 2026

Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs

Zhongzhi Li, Xuansheng Wu, Yijiang Li, Lijie Hu, and Ninghao Liu

PDF

Open Access 3 Models

TL;DR

This paper introduces a feature space-based data diversity metric and a synthesis method that improves large language models' performance by generating diverse, task-relevant data, applicable across multiple model families.

Contribution

It proposes Feature Activation Coverage (FAC) as an interpretable diversity metric and a FAC Synthesis framework that enhances data diversity and model performance through feature-aware data generation.

Findings

01

Improved downstream task performance with synthetic data

02

Shared feature space enables cross-model knowledge transfer

03

Enhanced data diversity leads to better instruction following and toxicity detection

Abstract

The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduce Feature Activation Coverage (FAC) which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, named FAC Synthesis, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification