Reformulation for Pretraining Data Augmentation

Xintong Hao; Ruijie Zhu; Ge Zhang; Ke Shen; Chenggang Li

arXiv:2502.04235·cs.CL·May 20, 2025

Reformulation for Pretraining Data Augmentation

Xintong Hao, Ruijie Zhu, Ge Zhang, Ke Shen, Chenggang Li

PDF

Open Access 3 Datasets 3 Reviews

TL;DR

This paper introduces MGA, a scalable data augmentation method that reformulates existing text corpora to reduce repetition and improve large language model training efficiency, demonstrated on a 770 billion token dataset.

Contribution

We propose MGA, a novel reformulation technique for data augmentation that mitigates repetition issues and enhances large language model scaling, supported by the creation of the MGACorpus.

Findings

01

MGA outperforms data repetition and upsampling methods in scaling scenarios.

02

The MGACorpus contains 770 billion tokens for training.

03

Prompt engineering influences generation quality and evaluation metrics.

Abstract

Despite the impressive capabilities of large language models across various tasks, their continued scaling is severely hampered not only by data scarcity but also by the performance degradation associated with excessive data repetition during training. To overcome this critical bottleneck, we propose the Massive Genre-Audience(MGA) reformulation method, a lightweight and scalable data augmentation technique inspired by synthetic data methodologies. MGA systematically reformulates existing corpora into diverse, contextually-rich variations to mitigate the negative effects of repetition, and we introduce this approach along with the resulting 770 billion token MGACorpus in this work. We experimentally validate its core benefit by demonstrating superior performance against data repetition and upsampling in scaling scenarios (up to 13B parameters). Furthermore, comprehensive analysis…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

Most of the experiments in this paper are done very thoroughly, and results are presented in a very neat and orderly manner. I appreciate that the authors release an open-source dataset and also release tooling and other artifacts that make this work reproducible.

Weaknesses

A key thing I wanted to know in this paper was whether the author’s proposed approach, MGA, outperforms alternative approaches for generating synthetic data. Even though the authors say that “MGA is not in competition with but is complementary to other synthetic data methodologies” (line 370), the authors should properly demonstrate MGA’s utility in the current landscape of synthetic data generation strategies. I am not sure if I got a clear answer to this question from the current version of th

Reviewer 02Rating 8Confidence 3

Strengths

1. The greatest strength is the scaling plots, which demonstrate consistent performance improvement in the token-matched regime. Really nice! Figure 5 was particularly great. The training noise seems relatively less which is good for trusting the downstream evals. 2. Really nice overall benchmark numbers. Grounds the work with good numbers to help set this up relative to other methods. Achieving a MMLU score of 40.7 is great for the 1.7B model class. 3. Beating finding more hq data is really

Weaknesses

My main gripe is that I would have liked to see a few more baselines of synthetic data/rephrasing. WRAP, for example, would have been nice to show. I understand that this would have been expensive at the higher token counts, but at the lower token counts, it would have been nice. There are great synthetic pretraining data techniques. Could you add WRAP or just adding other synthetic post-training datasets (nemotron for example) would be great?

Reviewer 03Rating 6Confidence 3

Strengths

1. The paper is well organized and clear. Specific research questions are laid out and experiments are directly designed to support each one in a clear manner. 2. Comprehensive results across many datasets and models at 5 different scales. 3. The paper emphasizes and shows the synergistic nature of their method. 4. Authors state they will release MGACorpus and all artifacts for reproducibility. 5. The motivation of addressing negative influences of data repetition during training is important.

Weaknesses

1. The method is based on prompt engineering and as such there are some potential issues with bias introduced by using a solely LLM based system. For example, when the LLM is used for judging, what sorts of bias is being introduced? Was some form of human evaluation done or some alternative test to determine whether the scores produced by the LLM align with the actual quality of the text. Furthermore there should be some evidence that actual diverse outputs are being generated by the method. I b

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Translation Studies and Practices