Examining False Positives under Inference Scaling for Mathematical Reasoning

Yu Wang; Nan Yang; Liang Wang; Furu Wei; Fuli Feng

arXiv:2502.06217·cs.CL·September 19, 2025

Examining False Positives under Inference Scaling for Mathematical Reasoning

Yu Wang, Nan Yang, Liang Wang, Furu Wei, Fuli Feng

PDF

Open Access

TL;DR

This paper investigates the prevalence and impact of false positive solutions in mathematical reasoning by language models, revealing persistent issues across models and datasets that affect evaluation metrics and scaling behavior.

Contribution

It systematically analyzes false positives in mathematical problem solving, highlighting their persistence and influence on inference scaling and evaluation metrics.

Findings

01

False positives are common across models and datasets.

02

Sampling-based inference does not reduce false positives.

03

pass@N metric is more affected by false positives, indicating a lower scaling ceiling.

Abstract

Recent advancements in language models have led to significant improvements in mathematical reasoning across various benchmarks. However, most of these benchmarks rely on automatic evaluation methods that only compare final answers using heuristics, without verifying the underlying reasoning steps. This limitation results in false positive solutions, where models may produce correct final answers but with flawed deduction paths. In this paper, we systematically examine the prevalence of false positive solutions in mathematical problem solving for language models. We analyze the characteristics and extent of this issue across different open-source models, datasets of varying difficulty levels, and decoding strategies. Specifically, we explore how false positives influence the inference time scaling behavior of language models. Our experimental results reveal that: (1) false positive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning