Evaluation is All You Need: Strategic Overclaiming of LLM Reasoning Capabilities Through Evaluation Design

Lin Sun; Weihong Lin; Jinzhu Wu; Yongfu Zhu; Xiaoqi Jian; Guangxiang Zhao; Change Jia; Linglin Zhang; Sai-er Hu; Yuhan Wu; Xiangzheng Zhang

arXiv:2506.04734·cs.AI·June 16, 2025

Evaluation is All You Need: Strategic Overclaiming of LLM Reasoning Capabilities Through Evaluation Design

Lin Sun, Weihong Lin, Jinzhu Wu, Yongfu Zhu, Xiaoqi Jian, Guangxiang Zhao, Change Jia, Linglin Zhang, Sai-er Hu, Yuhan Wu, Xiangzheng Zhang

PDF

Open Access

TL;DR

This paper highlights how evaluation design significantly influences perceived reasoning capabilities of LLMs, revealing fluctuations and reproducibility issues in benchmark results, and advocates for more rigorous evaluation standards.

Contribution

It provides empirical assessments of the Deepseek-R1-Distill models and emphasizes the need for improved evaluation paradigms to ensure reliable performance measurement.

Findings

01

Evaluation results vary significantly with different conditions.

02

Reproducibility of performance improvements is challenging.

03

Benchmark assessments are sensitive to evaluation design.

Abstract

Reasoning models represented by the Deepseek-R1-Distill series have been widely adopted by the open-source community due to their strong performance in mathematics, science, programming, and other domains. However, our study reveals that their benchmark evaluation results are subject to significant fluctuations caused by various factors. Subtle differences in evaluation conditions can lead to substantial variations in results. Similar phenomena are observed in other open-source inference models fine-tuned based on the Deepseek-R1-Distill series, as well as in the QwQ-32B model, making their claimed performance improvements difficult to reproduce reliably. Therefore, we advocate for the establishment of a more rigorous paradigm for model performance evaluation and present our empirical assessments of the Deepseek-R1-Distill series models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Digital Rights Management and Security