Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions
Zijin Hong, Hao Wu, Su Dong, Junnan Dong, Yilin Xiao, Yujing Zhang, Zhu Wang, Feiran Huang, Linyi Li, Hongxia Yang, Xiao Huang

TL;DR
This paper introduces RV-Bench, a new evaluation framework using randomized variable questions to assess the genuine mathematical reasoning capabilities of large language models, revealing their limitations in generalization and robustness.
Contribution
We propose RV-Bench, a novel benchmark with unseen randomized questions to better evaluate LLMs' true mathematical reasoning skills, addressing issues of data contamination and superficial learning.
Findings
LLMs show proficiency gaps between seen and unseen data.
Limited generalization of LLMs across similar reasoning tasks.
Test-time scaling can improve LLM performance on RV-Bench.
Abstract
Recent studies have raised significant concerns regarding the reliability of current mathematics benchmarks, highlighting issues such as simplistic design and potential data contamination. Consequently, developing a reliable benchmark that effectively evaluates large language models' (LLMs) genuine capabilities in mathematical reasoning remains a critical challenge. To address these concerns, we propose RV-Bench, a novel evaluation methodology for Benchmarking LLMs with Random Variables in mathematical reasoning. Specifically, we build question-generating functions to produce random variable questions (RVQs), whose background content mirrors original benchmark problems, but with randomized variable combinations, rendering them "unseen" to LLMs. Models must completely understand the inherent question pattern to correctly answer RVQs with diverse variable combinations. Thus, an LLM's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
