The Impact of Post-training on Data Contamination

Muhammed Yusuf Kocyigit; Caglar Yildirim

arXiv:2601.06103·cs.LG·January 13, 2026

The Impact of Post-training on Data Contamination

Muhammed Yusuf Kocyigit, Caglar Yildirim

PDF

Open Access 3 Reviews

TL;DR

This study investigates how dataset contamination affects large language models during post-training, revealing that contamination causes performance inflation which can be mitigated through specific post-training methods and scale considerations.

Contribution

It provides a controlled analysis of contamination effects in large language models and compares the impacts of supervised fine-tuning and reinforcement learning post-training methods.

Findings

01

Contamination causes performance spikes that diminish with continued pre-training.

02

SFT inflates scores only on contaminated tasks, while GRPO inflates on both contaminated and uncontaminated tasks.

03

Larger models memorize more and translate leakage into more general capabilities.

Abstract

We present a controlled study of how dataset contamination interacts with the post-training stages now standard in large language model training pipelines. Starting from clean checkpoints of Qwen2.5 (0.5B/1.5B) and Gemma3 (1B/4B), we inject five copies of GSM8K and MBPP test items into the first 2B tokens of an otherwise 25B token extended pre-training dataset. We then compare the contaminated and clean models both immediately after pre-training and again after two popular post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL) with group relative policy optimization (GRPO). The applied post-training steps do not have any contamination. Across math and coding benchmarks, we find three consistent patterns: (i) Contamination causes performance spikes that are gradually diminished with continued pre-training. After even 25B tokens the apparent performance…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 4

Strengths

1. Realistic Setting for Data Contamination Research. This paper used a pretraining -> SFT/RL setting, which is so far the most realistic to what could actually happen in model training. The previous work in data contamination research overwhelmingly focus on SFT on the test-set. Of course it's going to be super obvious that the model is contaminated. Although the model/corpus is small, it's an important step of moving towards the right direction. 2. Insightful findings. This paper finds out tha

Weaknesses

1. The main claims of the paper relies on a small performance gap. Around 2-4% across the experiments. Although they are smaller models, of limited capacity, it still makes me question the generalizability of this papers findings. 2. The difference between SFT and GRPO is a major contribution of the paper, but more depth (or hypothesised mechanism) would strengthen the claim. For example, are RL-tuned models less “local‐overfit” to contaminated items because the reward encourages broader pattern

Reviewer 02Rating 2Confidence 5

Strengths

S1- The topic is important to the community. Better understanding dynamics of data contamination in each stage of the model lifecycle is crucial to improving generalization. S2- The setup is easy to understand and results are clearly described.

Weaknesses

The main concern I have with the paper is that the study's scope is somewhat narrow and MVP. For example: - It is somewhat common knowledge that larger models can generalize better even when data contamination is present. The models studied in the paper are quite small and the findings may only be valid for this size. While compute constraints are common these days, perhaps even scaling to say the Olmo family of models (7b, 12b, 32b) might be more informative than staying in the 1-4B range.

Reviewer 03Rating 4Confidence 3

Strengths

1. Given the ubiquity of LLM post training, studying dataset contamination under this more practical scenario is useful and has more real-world value in determining whether dataset contamination is impactful. 2. The paper presents a clear research goal, the experiments used support the conclusions, and the results are clear. Presence of error bars makes for more rigidity in the results.

Weaknesses

1. The contribution is limited and practical takeaways should be expanded. The paper's main contribution is testing SFT and GRPO on top of contaminated pretrained models which has limited technical novelty given that it is a minor expansion over previous works studying the pretraining stage such as (Kocyigit et al., 2025; Jiang et al., 2024). While the conclusion that post training leads to inflation on contamination benchmarks is interesting, it retreads that dataset contamination is a major is

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Data Classification