Rewriting Pre-Training Data Boosts LLM Performance in Math and Code
Kazuki Fujii, Yukito Tajima, Sakae Mizuki, Masaki Kawamura, Hinari Shimada, Taihei Shiotani, Koshiro Saito, Masanari Oi, Taishi Nakamura, Takumi Okamoto, Shigeki Ishida, Kakeru Hattori, Youmi Ma, Hiroya Takamura, Rio Yokota, Jun Sakuma, Naoaki Okazaki

TL;DR
This paper demonstrates that systematically rewriting pre-training data with a novel pipeline significantly improves large language models' performance in math and code tasks, surpassing previous baselines.
Contribution
Introduces two openly licensed datasets, SwallowCode and SwallowMath, with a transform-and-retain pipeline that enhances data quality for LLM pre-training, leading to substantial performance gains.
Findings
SwallowCode improves code generation metrics by over 16 points.
SwallowMath increases accuracy on GSM8K and MATH benchmarks.
Pipeline ablations show each stage contributes to overall performance gains.
Abstract
The performance of large language models (LLMs) in program synthesis and mathematical reasoning is fundamentally limited by the quality of their pre-training corpora. We introduce two openly licensed pre-training datasets, released under the Llama 3.3 Community License, that significantly enhance LLM performance by systematically rewriting public data. SwallowCode (16.1 billion tokens) refines Python snippets from The-Stack-v2 through a novel four-stage pipeline: syntax validation, pylint-based style filtering, and a two-stage LLM rewriting process that enforces style conformity and transforms snippets into self-contained, algorithmically efficient examples. Unlike prior methods that rely on exclusionary filtering or limited transformations, our transform-and-retain approach refines low-quality code, maximizing data utility. SwallowMath (2.3 billion tokens) enhances…
Peer Reviews
Decision·ICLR 2026 Poster
The method of rewriting is well-motivated with sufficient ablation studies of different data-filtering methods. The improvements are substantial and consistent across multiple benchmarks (HumanEval, HumanEval+, GSM8K, MATH), with comprehensive ablation studies demonstrating that each pipeline stage contributes incrementally. The paper is well-written and easy to follow.
The novelty is somewhat limited. The paper reads more as an engineering work in improving pre-training performance with careful experimental design rather than introducing new ideas. The author should try to clarify the unique contribution and insight of this paper to academic community.
* This paper releases two open source datasets under permissive licenses, supporting community reuse and extension. * The ablation studies isolate the impact of each pipeline stage, demonstrating clear and interpretable improvements.
* The success and ceiling of this method are fundamentally limited by the capabilities and potential biases of the powerful LLM. Rewriting relies on Llama-3.3-70B-Instruct, which may introduce its own stylistic or semantic biases. * While the paper claims rewriting enforces "algorithmic efficiency," it is not clear how the paper quantifies or measures this specific gain. Providing concrete metrics or examples of complexity improvements would strengthen this claim.
- Thoughtful analysis of why synthetic-from-scratch may underperform due to diversity issues. - Decontamination checks and cross-model validation (Qwen2-7B) improve credibility. - Open release of data, prompts, and checkpoints enhances community value and reproducibility.
- generality claims are suggestive but not demonstrated across languages or larger scales. - no analysis of downstream generalization outside HumanEval/+. - No quantitative quality checks on rewritten outputs (e.g., compile/run rate, test pass rate, semantic drift)—risk of introducing hallucinated correctness.
Code & Models
- 🤗tokyotech-llm/Llama-3.1-8B-code-ablation-exp1-LR2.5e-5-MINLR2.5E-6-WD0.1-iter0002500model· 60 dl60 dl
- 🤗tokyotech-llm/Llama-3.1-8B-code-ablation-exp1-LR2.5e-5-MINLR2.5E-6-WD0.1-iter0005000model· 1 dl1 dl
- 🤗tokyotech-llm/Llama-3.1-8B-code-ablation-exp1-LR2.5e-5-MINLR2.5E-6-WD0.1-iter0007500model· 1 dl1 dl
- 🤗tokyotech-llm/Llama-3.1-8B-code-ablation-exp1-LR2.5e-5-MINLR2.5E-6-WD0.1-iter0010000model· 17 dl17 dl
- 🤗tokyotech-llm/Llama-3.1-8B-code-ablation-exp1-LR2.5e-5-MINLR2.5E-6-WD0.1-iter0012500model· 4 dl4 dl
- 🤗tokyotech-llm/Llama-3.1-8B-code-ablation-exp2-LR2.5e-5-MINLR2.5E-6-WD0.1-iter0002500model
- 🤗tokyotech-llm/Llama-3.1-8B-code-ablation-exp3-LR2.5e-5-MINLR2.5E-6-WD0.1-iter0002500model· 1 dl1 dl
- 🤗tokyotech-llm/Llama-3.1-8B-code-ablation-exp2-LR2.5e-5-MINLR2.5E-6-WD0.1-iter0005000model
- 🤗tokyotech-llm/Llama-3.1-8B-code-ablation-exp3-LR2.5e-5-MINLR2.5E-6-WD0.1-iter0005000model
- 🤗tokyotech-llm/Llama-3.1-8B-code-ablation-exp2-LR2.5e-5-MINLR2.5E-6-WD0.1-iter0007500model· 2 dl2 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Digital Rights Management and Security · Natural Language Processing Techniques
MethodsLLaMA
