Examining False Positives under Inference Scaling for Mathematical Reasoning
Yu Wang, Nan Yang, Liang Wang, Furu Wei, Fuli Feng

TL;DR
This paper investigates the prevalence and impact of false positive solutions in mathematical reasoning by language models, revealing persistent issues across models and datasets that affect evaluation metrics and scaling behavior.
Contribution
It systematically analyzes false positives in mathematical problem solving, highlighting their persistence and influence on inference scaling and evaluation metrics.
Findings
False positives are common across models and datasets.
Sampling-based inference does not reduce false positives.
pass@N metric is more affected by false positives, indicating a lower scaling ceiling.
Abstract
Recent advancements in language models have led to significant improvements in mathematical reasoning across various benchmarks. However, most of these benchmarks rely on automatic evaluation methods that only compare final answers using heuristics, without verifying the underlying reasoning steps. This limitation results in false positive solutions, where models may produce correct final answers but with flawed deduction paths. In this paper, we systematically examine the prevalence of false positive solutions in mathematical problem solving for language models. We analyze the characteristics and extent of this issue across different open-source models, datasets of varying difficulty levels, and decoding strategies. Specifically, we explore how false positives influence the inference time scaling behavior of language models. Our experimental results reveal that: (1) false positive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning
