LM-mixup: Text Data Augmentation via Language Model based Mixup

Zhijie Deng; Zhouan Shen; Ling Li; Yao Zhou; Zhaowei Zhu; Yanji He; Wei Wang; Jiaheng Wei

arXiv:2510.20449·cs.CL·October 24, 2025

LM-mixup: Text Data Augmentation via Language Model based Mixup

Zhijie Deng, Zhouan Shen, Ling Li, Yao Zhou, Zhaowei Zhu, Yanji He, Wei Wang, Jiaheng Wei

PDF

Open Access 3 Reviews

TL;DR

This paper introduces LM-Mixup, a novel data augmentation method that distills and enhances low-quality instruction data for large language models, improving their instruction-following capabilities efficiently.

Contribution

It presents a new dataset construction pipeline, MIXTURE, and a reinforcement learning-based augmentation technique, LM-Mixup, to effectively utilize low-quality data for instruction tuning.

Findings

01

LM-Mixup improves model performance using only 3% of data.

02

Distilled data surpasses full dataset training in benchmarks.

03

Low-quality data can be valuable when properly distilled and augmented.

Abstract

Instruction tuning is crucial for aligning Large Language Models (LLMs), yet the quality of instruction-following data varies significantly. While high-quality data is paramount, it is often scarce; conversely, abundant low-quality data is frequently discarded, leading to substantial information loss. Existing data augmentation methods struggle to augment this low-quality data effectively, and the evaluation of such techniques remains poorly defined. To address this, we formally define the task of Instruction Distillation: distilling multiple low-quality and redundant inputs into high-quality and coherent instruction-output pairs. Specifically, we introduce a comprehensive data construction pipeline to create MIXTURE, a 144K-sample dataset pairing low-quality or semantically redundant imperfect instruction clusters with their high-quality distillations. We then introduce LM-Mixup, by…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

Practical Value: Demonstrates that low-quality data can be effectively repurposed to reduce dataset size and cost. Comprehensive Experiments: Ablation studies and multiple benchmark evaluations strengthen empirical claims. Reproducibility: Detailed training pipelines and dataset construction steps are provided.

Weaknesses

Limited Academic Depth: The method feels like a well-engineered application of existing tools rather than a theoretical or algorithmic advance. Weak Motivation: The paper does not sufficiently articulate why this particular form of "instruction distillation" is needed or how it differs philosophically from prior data augmentation or curation work. Presentation Flaws: Inconsistent referencing and occasionally overly technical writing reduce readability and scholarly rigor.

Reviewer 02Rating 6Confidence 4

Strengths

1. The task studied is meaningful and significant, and it would be great to have a high-quality dataset for the task 2. It's novel to train a model to do this specific task

Weaknesses

1. I believe the task itself is not novel and there are papers, e.g., https://arxiv.org/pdf/2503.00034, that have discussed this problem. Yet these papers are not mentioned in the related works. 2. I do not fully understand the experiment setting. Please see question 1. This can potentially weaken the results of the experiments. 3. Although the authors have already explained it in section 5, LLM rating bias is still a valid concern. Specially, we see in the experiments that 50% low quality + 50%

Reviewer 03Rating 4Confidence 3

Strengths

- The paper formalizes the Instruction Distillation task, clearly articulating the goal of transforming multiple imperfect inputs into a single high-quality instruction-output pair. - Extensive experiments demonstrate consistent improvements across both in-domain (MIXTURE) and out-of-domain (OpenLLM leaderboard) benchmarks.

Weaknesses

1. Limited Practical Applicability. The proposed setting assumes the availability of a large proportion of low-quality or redundant data, which may not reflect many real-world instruction-tuning scenarios where most data are already of moderate or acceptable quality. Consequently, the practical scope of Instruction Distillation could be narrower than implied, and its benefits may diminish when applied to more balanced or higher-quality datasets. 2. The paper’s exposition is at times vague and d

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification