LM-mixup: Text Data Augmentation via Language Model based Mixup
Zhijie Deng, Zhouan Shen, Ling Li, Yao Zhou, Zhaowei Zhu, Yanji He, Wei Wang, Jiaheng Wei

TL;DR
This paper introduces LM-Mixup, a novel data augmentation method that distills and enhances low-quality instruction data for large language models, improving their instruction-following capabilities efficiently.
Contribution
It presents a new dataset construction pipeline, MIXTURE, and a reinforcement learning-based augmentation technique, LM-Mixup, to effectively utilize low-quality data for instruction tuning.
Findings
LM-Mixup improves model performance using only 3% of data.
Distilled data surpasses full dataset training in benchmarks.
Low-quality data can be valuable when properly distilled and augmented.
Abstract
Instruction tuning is crucial for aligning Large Language Models (LLMs), yet the quality of instruction-following data varies significantly. While high-quality data is paramount, it is often scarce; conversely, abundant low-quality data is frequently discarded, leading to substantial information loss. Existing data augmentation methods struggle to augment this low-quality data effectively, and the evaluation of such techniques remains poorly defined. To address this, we formally define the task of Instruction Distillation: distilling multiple low-quality and redundant inputs into high-quality and coherent instruction-output pairs. Specifically, we introduce a comprehensive data construction pipeline to create MIXTURE, a 144K-sample dataset pairing low-quality or semantically redundant imperfect instruction clusters with their high-quality distillations. We then introduce LM-Mixup, by…
Peer Reviews
Decision·Submitted to ICLR 2026
Practical Value: Demonstrates that low-quality data can be effectively repurposed to reduce dataset size and cost. Comprehensive Experiments: Ablation studies and multiple benchmark evaluations strengthen empirical claims. Reproducibility: Detailed training pipelines and dataset construction steps are provided.
Limited Academic Depth: The method feels like a well-engineered application of existing tools rather than a theoretical or algorithmic advance. Weak Motivation: The paper does not sufficiently articulate why this particular form of "instruction distillation" is needed or how it differs philosophically from prior data augmentation or curation work. Presentation Flaws: Inconsistent referencing and occasionally overly technical writing reduce readability and scholarly rigor.
1. The task studied is meaningful and significant, and it would be great to have a high-quality dataset for the task 2. It's novel to train a model to do this specific task
1. I believe the task itself is not novel and there are papers, e.g., https://arxiv.org/pdf/2503.00034, that have discussed this problem. Yet these papers are not mentioned in the related works. 2. I do not fully understand the experiment setting. Please see question 1. This can potentially weaken the results of the experiments. 3. Although the authors have already explained it in section 5, LLM rating bias is still a valid concern. Specially, we see in the experiments that 50% low quality + 50%
- The paper formalizes the Instruction Distillation task, clearly articulating the goal of transforming multiple imperfect inputs into a single high-quality instruction-output pair. - Extensive experiments demonstrate consistent improvements across both in-domain (MIXTURE) and out-of-domain (OpenLLM leaderboard) benchmarks.
1. Limited Practical Applicability. The proposed setting assumes the availability of a large proportion of low-quality or redundant data, which may not reflect many real-world instruction-tuning scenarios where most data are already of moderate or acceptable quality. Consequently, the practical scope of Instruction Distillation could be narrower than implied, and its benefits may diminish when applied to more balanced or higher-quality datasets. 2. The paper’s exposition is at times vague and d
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
