EvolMathEval: Towards Evolvable Benchmarks for Mathematical Reasoning via Evolutionary Testing
Shengbo Wang, Mingwei Liu, Zike Li, Anji Li, Yanlin Wang, Xin Peng, Zibin Zheng

TL;DR
EvolMathEval introduces an evolutionary framework to generate increasingly difficult mathematical reasoning benchmarks, revealing limitations of LLMs and exposing their reliance on superficial cues rather than genuine reasoning.
Contribution
This work presents a novel automated framework for evolving mathematical benchmarks, significantly increasing problem complexity and exposing weaknesses in LLM reasoning capabilities.
Findings
EvolMathEval generates high-difficulty problems through self-iteration.
Evolved datasets reduce LLM accuracy by an average of 48%.
Most LLM errors are due to reliance on simplistic cues, termed 'Pseudo Aha Moments'.
Abstract
The rapid advancement of Large Language Models (LLMs) poses a significant challenge to existing mathematical reasoning benchmarks. However, these benchmarks tend to become easier over time as LLMs can learn from the published benchmarks. This limitation hinder the precise evaluation of the true capabilities of SOTA models. To address this challenge, this paper introduces EvolMathEval, an automated mathematical benchmark generation and evolution framework based on evolutionary testing. Experimental results demonstrate that EvolMathEval can not only generate a large volume of high-difficulty problems through continuous self-iteration, but it can also significantly enhance the complexity of public datasets like GSM8K through evolution, reducing model accuracy by an average of 48\%. Deeper investigation reveals that when solving these evolved problems, LLMs tend to bypass complex multi-step…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEvolutionary Algorithms and Applications · Reinforcement Learning in Robotics · Machine Learning and Data Classification
