Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

Kazuki Fujii; Yukito Tajima; Sakae Mizuki; Masaki Kawamura; Hinari Shimada; Taihei Shiotani; Koshiro Saito; Masanari Oi; Taishi Nakamura; Takumi Okamoto; Shigeki Ishida; Kakeru Hattori; Youmi Ma; Hiroya Takamura; Rio Yokota; Jun Sakuma; Naoaki Okazaki

arXiv:2505.02881·cs.LG·March 3, 2026

Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

Kazuki Fujii, Yukito Tajima, Sakae Mizuki, Masaki Kawamura, Hinari Shimada, Taihei Shiotani, Koshiro Saito, Masanari Oi, Taishi Nakamura, Takumi Okamoto, Shigeki Ishida, Kakeru Hattori, Youmi Ma, Hiroya Takamura, Rio Yokota, Jun Sakuma, Naoaki Okazaki

PDF

Open Access 1 Repo 10 Models 5 Datasets 3 Reviews

TL;DR

This paper demonstrates that systematically rewriting pre-training data with a novel pipeline significantly improves large language models' performance in math and code tasks, surpassing previous baselines.

Contribution

Introduces two openly licensed datasets, SwallowCode and SwallowMath, with a transform-and-retain pipeline that enhances data quality for LLM pre-training, leading to substantial performance gains.

Findings

01

SwallowCode improves code generation metrics by over 16 points.

02

SwallowMath increases accuracy on GSM8K and MATH benchmarks.

03

Pipeline ablations show each stage contributes to overall performance gains.

Abstract

The performance of large language models (LLMs) in program synthesis and mathematical reasoning is fundamentally limited by the quality of their pre-training corpora. We introduce two openly licensed pre-training datasets, released under the Llama 3.3 Community License, that significantly enhance LLM performance by systematically rewriting public data. SwallowCode ( $\approx$ 16.1 billion tokens) refines Python snippets from The-Stack-v2 through a novel four-stage pipeline: syntax validation, pylint-based style filtering, and a two-stage LLM rewriting process that enforces style conformity and transforms snippets into self-contained, algorithmically efficient examples. Unlike prior methods that rely on exclusionary filtering or limited transformations, our transform-and-retain approach refines low-quality code, maximizing data utility. SwallowMath ( $\approx$ 2.3 billion tokens) enhances…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

The method of rewriting is well-motivated with sufficient ablation studies of different data-filtering methods. The improvements are substantial and consistent across multiple benchmarks (HumanEval, HumanEval+, GSM8K, MATH), with comprehensive ablation studies demonstrating that each pipeline stage contributes incrementally. The paper is well-written and easy to follow.

Weaknesses

The novelty is somewhat limited. The paper reads more as an engineering work in improving pre-training performance with careful experimental design rather than introducing new ideas. The author should try to clarify the unique contribution and insight of this paper to academic community.

Reviewer 02Rating 6Confidence 2

Strengths

* This paper releases two open source datasets under permissive licenses, supporting community reuse and extension. * The ablation studies isolate the impact of each pipeline stage, demonstrating clear and interpretable improvements.

Weaknesses

* The success and ceiling of this method are fundamentally limited by the capabilities and potential biases of the powerful LLM. Rewriting relies on Llama-3.3-70B-Instruct, which may introduce its own stylistic or semantic biases. * While the paper claims rewriting enforces "algorithmic efficiency," it is not clear how the paper quantifies or measures this specific gain. Providing concrete metrics or examples of complexity improvements would strengthen this claim.

Reviewer 03Rating 4Confidence 3

Strengths

- Thoughtful analysis of why synthetic-from-scratch may underperform due to diversity issues. - Decontamination checks and cross-model validation (Qwen2-7B) improve credibility. - Open release of data, prompts, and checkpoints enhances community value and reproducibility.

Weaknesses

- generality claims are suggestive but not demonstrated across languages or larger scales. - no analysis of downstream generalization outside HumanEval/+. - No quantitative quality checks on rewritten outputs (e.g., compile/run rate, test pass rate, semantic drift)—risk of introducing hallucinated correctness.

Code & Models

Repositories

rioyokotalab/swallow-code-math
noneOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Digital Rights Management and Security · Natural Language Processing Techniques

MethodsLLaMA