Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions

Zijin Hong; Hao Wu; Su Dong; Junnan Dong; Yilin Xiao; Yujing Zhang; Zhu Wang; Feiran Huang; Linyi Li; Hongxia Yang; Xiao Huang

arXiv:2501.11790·cs.CL·August 14, 2025

Benchmarking LLMs' Mathematical Reasoning with Unseen Random Variables Questions

Zijin Hong, Hao Wu, Su Dong, Junnan Dong, Yilin Xiao, Yujing Zhang, Zhu Wang, Feiran Huang, Linyi Li, Hongxia Yang, Xiao Huang

PDF

Open Access

TL;DR

This paper introduces RV-Bench, a new evaluation framework using randomized variable questions to assess the genuine mathematical reasoning capabilities of large language models, revealing their limitations in generalization and robustness.

Contribution

We propose RV-Bench, a novel benchmark with unseen randomized questions to better evaluate LLMs' true mathematical reasoning skills, addressing issues of data contamination and superficial learning.

Findings

01

LLMs show proficiency gaps between seen and unseen data.

02

Limited generalization of LLMs across similar reasoning tasks.

03

Test-time scaling can improve LLM performance on RV-Bench.

Abstract

Recent studies have raised significant concerns regarding the reliability of current mathematics benchmarks, highlighting issues such as simplistic design and potential data contamination. Consequently, developing a reliable benchmark that effectively evaluates large language models' (LLMs) genuine capabilities in mathematical reasoning remains a critical challenge. To address these concerns, we propose RV-Bench, a novel evaluation methodology for Benchmarking LLMs with Random Variables in mathematical reasoning. Specifically, we build question-generating functions to produce random variable questions (RVQs), whose background content mirrors original benchmark problems, but with randomized variable combinations, rendering them "unseen" to LLMs. Models must completely understand the inherent question pattern to correctly answer RVQs with diverse variable combinations. Thus, an LLM's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling