Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

Mingqi Wu; Zhihao Zhang; Qiaole Dong; Zhiheng Xi; Jun Zhao; Senjie Jin; Xiaoran Fan; Yuhao Zhou; Huijie Lv; Ming Zhang; Yanwei Fu; Qin Liu; Songyang Zhang; Qi Zhang

arXiv:2507.10532·cs.LG·December 18, 2025

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

Mingqi Wu, Zhihao Zhang, Qiaole Dong, Zhiheng Xi, Jun Zhao, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Huijie Lv, Ming Zhang, Yanwei Fu, Qin Liu, Songyang Zhang, Qi Zhang

PDF

1 Repo 1 Video

TL;DR

This paper investigates the reliability of reinforcement learning results in large language models, revealing that data contamination in benchmarks can lead to misleading conclusions, and introduces a clean dataset for accurate evaluation.

Contribution

The authors identify data contamination issues in popular benchmarks and propose a new leakage-free dataset, RandomCalculation, for trustworthy evaluation of RL in mathematical reasoning.

Findings

01

Contaminated benchmarks can produce unreliable RL results.

02

Accurate reward signals improve model performance beyond baseline.

03

Random rewards do not enhance mathematical reasoning performance.

Abstract

Reasoning in large language models has long been a central research focus, and recent studies employing reinforcement learning (RL) have introduced diverse methods that yield substantial performance gains with minimal or even no external supervision. Surprisingly, some studies even suggest that random or incorrect reward signals can enhance performance. However, these breakthroughs are predominantly observed for the mathematically strong Qwen2.5 series on benchmarks such as MATH-500, AMC, and AIME, and seldom transfer to models like Llama, which warrants a more in-depth investigation. In this work, our empirical analysis reveals that pre-training on massive web-scale corpora leaves Qwen2.5 susceptible to data contamination in widely used benchmarks. Consequently, conclusions derived from contaminated benchmarks on Qwen2.5 series may be unreliable. To obtain trustworthy evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wumingqi/LLM-Math-Evaluation
noneOfficial

Videos

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination· underline

Taxonomy

MethodsFocus