Let's Verify Math Questions Step by Step

Chengyu Shen; Zhen Hao Wong; Runming He; Hao Liang; Meiyi Qiang; Zimo Meng; Zhengyang Zhao; Bohan Zeng; Zhengzhou Zhu; Bin Cui; Wentao Zhang

arXiv:2505.13903·cs.CL·March 13, 2026

Let's Verify Math Questions Step by Step

Chengyu Shen, Zhen Hao Wong, Runming He, Hao Liang, Meiyi Qiang, Zimo Meng, Zhengyang Zhao, Bohan Zeng, Zhengzhou Zhu, Bin Cui, Wentao Zhang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces ValiMath, a high-quality benchmark of 2147 verified math questions, and MathQ-Verify, a pipeline for parsing and verifying question correctness to improve dataset quality for LLM training.

Contribution

The paper presents ValiMath as a new benchmark for question correctness and introduces MathQ-Verify, a novel method for fine-grained parsing and semantic verification of math questions.

Findings

01

ValiMath provides a reliable gold-standard dataset for math question evaluation.

02

MathQ-Verify achieves state-of-the-art accuracy in question verification tasks.

03

The pipeline significantly reduces noise in mathematical datasets, enhancing LLM training quality.

Abstract

Large Language Models (LLMs) have recently achieved remarkable progress in mathematical reasoning. To enable such capabilities, many existing works distill strong reasoning models into long chains of thought or design algorithms to construct high-quality math question-answer (QA) data for training. However, these efforts primarily focus on generating correct reasoning paths and answers, while largely overlooking the correctness of the questions themselves. In this work, we present ValiMath, a benchmark consisting of 2147 human-verified mathematical questions covering a wide range of domains such as arithmetic, algebra, and geometry, which are synthesized and curated from the NuminaMath dataset. Each question is annotated with its logical structure, domain coverage, and question correctness, enabling fine-grained evaluation of question quality. ValiMath serves as a high-quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

scuuy/mathq-verify
noneOfficial

Datasets

scuuy666/ValiMath
dataset· 16 dl
16 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Topic Modeling · Natural Language Processing Techniques

MethodsFocus