Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning
Dawid J. Kopiczko, Sagar Vaze, Tijmen Blankevoort, Yuki M. Asano

TL;DR
This paper demonstrates that, for reasoning language models, training with repeated data over multiple epochs on small datasets can outperform training once on larger datasets, challenging traditional data scaling assumptions.
Contribution
It reveals that data repetition during supervised fine-tuning enhances reasoning performance, establishing a new understanding of training dynamics for large language models.
Findings
Repeated training on small datasets outperforms single training on larger datasets.
Token accuracy signals saturation and optimal stopping point.
Full memorization correlates with improved generalization.
Abstract
Supervised fine-tuning (SFT) on chain-of-thought data is an essential post-training step for reasoning language models. Standard machine learning intuition suggests that training with more unique training samples yields better generalization. Counterintuitively, we show that SFT benefits from repetition: under a fixed update budget, training for more epochs on smaller datasets outperforms single-epoch training on larger datasets. On AIME'24/25 and GPQA benchmarks, Olmo3-7B trained for 128 epochs on 400 samples outperforms the equivalent 1 epoch on 51200 samples by 12-26 percentage points, with no additional catastrophic forgetting. We find that training token accuracy reliably signals when repetition has saturated; improvements from additional epochs plateau at full memorization, a pattern consistent across all settings. These findings provide a practical approach for reasoning SFT,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
