VAR-MATH: Probing True Mathematical Reasoning in LLMS via Symbolic Multi-Instance Benchmarks
Jian Yao, Ran Cheng, and Kay Chen Tan

TL;DR
This paper introduces VAR-MATH, a symbolic multi-instance benchmark framework to evaluate genuine mathematical reasoning in large language models, revealing that current RL-trained models often overfit and lack true reasoning ability.
Contribution
The paper proposes VAR-MATH, a novel evaluation paradigm that converts fixed problems into parameterized templates to better assess reasoning and reduce data contamination in benchmarks.
Findings
RL-trained models show significant performance drops on VAR-MATH benchmarks.
Models rely on superficial heuristics rather than true reasoning.
The framework enhances robustness and reduces data leakage in evaluation.
Abstract
Recent advances in reinforcement learning (RL) have led to substantial improvements in the mathematical reasoning abilities of LLMs, as measured by standard benchmarks. Yet these gains often persist even when models are trained with flawed signals, such as random or inverted rewards. This raises a fundamental question: do such improvements reflect genuine reasoning, or are they merely artifacts of overfitting to benchmark-specific patterns? To answer this question, we adopt an evaluation-centric perspective and highlight two critical shortcomings in existing protocols. First, benchmark contamination arises because test problems are publicly available, thereby increasing the risk of data leakage. Second, evaluation fragility results from reliance on single-instance assessments, which are sensitive to stochastic outputs and fail to capture reasoning consistency. These limitations suggest…
Peer Reviews
Decision·Submitted to ICLR 2026
- **Right problem, right lens.** Moving from one-shot correctness to multi-instance consistency tests structural understanding. - **Clear empirical signal.** The dataset is valuable and shows consistent, cross-model drops on variabilized sets, especially for small/medium RL-tuned models. - **Timely benchmark.** The paper situates VAR-MATH among dynamic/functional-variation work (e.g., GSM-Symbolic, LiveBench, Putnam-AXIOM) and brings symbolic variation to today’s AIME-level tasks. I think thi
[Critical] I worry that the decline’s cause is misattributed. The evidence might not support benchmark contamination as the primary driver. I specifically state the alternative explanations which I worry might fit the data better (and how to remove these confounders): - **Strict drop apples-to-oranges metric.** The comparison compares AIME pass@1 against VAR-AIME 5/5 consistency. For fairness, compare strict VAR-AIME to a strict AIME defined as “correct only if all 5/5 sampling runs per problem
- Clear motivation (contamination & fragility) and a sensible multi‑instance / consistency protocol. - Broad empirical sweep across contemporary RL models with interpretable strict vs. loose metrics. - Evidence that multi‑instance evaluation reduces variance and reveals failure modes hidden by single‑instance scoring.
The main weakness is that paper fails to cite and contrast its work against previous work that do very similar explorations. For example, RE‑IMAGINE (ICML’25), neuro-symbolic data gen ([NeurIPS 2024](https://arxiv.org/abs/2412.04857)) or any other symbolic benchmarking papers like (GSM Hard, GSM-Symbolic, GSM-IC.. etc) which already (partly) introduced a symbolic representation → mutation → automatic ground‑truth pipeline, modes of difficulty, and reporting across math/code. Overlap is substanti
1. Creation of a new dataset containing 430 question-answer pairs to tackle contamination and evaluation fragility. 2. In-depth evaluation of various models, both reasoning and non-reasoning models. 3. The authors employ a data processing method to convert each numerical problem into a symbolic template.
1. The authors talk about two existing issues with current benchmarks: contamination and evaluation fragility. While I agree that these datasets are publicly available, models can easily memorize them, which leads to contamination. However author does not provide strong evidence that evaluation fragility is present in current benchmarks, especially in datasets like AIME, AMC. 2. The main idea of this dataset is to convert each problem into a symbolic template, which decouples problem structure f
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing
