Pessimistic Verification for Open Ended Math Questions
Yanxing Huang, Zihan Tang, Zejin Lin, Peng Li, Yang Liu

TL;DR
This paper introduces pessimistic verification, a simple yet effective method that constructs multiple parallel checks to improve the accuracy of open-ended math question verification, outperforming existing techniques without high computational costs.
Contribution
It proposes a novel pessimistic verification approach that enhances math verification performance by using parallel checks, addressing false negatives and dataset annotation errors.
Findings
Significantly improves verification accuracy on math benchmarks
Outperforms extended long-CoT in test-time efficiency
Reduces false negatives caused by dataset annotation errors
Abstract
The key limitation of the verification performance lies in the ability of error detection. With this intuition we designed several variants of pessimistic verification, which are simple workflows that could significantly improve the verification of open-ended math questions. In pessimistic verification we construct multiple parallel verifications for the same proof, and the proof is deemed incorrect if any one of them reports an error. This simple technique significantly improves the performance across many math verification benchmarks without incurring substantial computational resources. Its token efficiency even surpassed extended long-CoT in test-time scaling. Our case studies further indicate that the majority of false negatives in stronger models are actually caused by annotation errors in the original dataset, so our method's performance is in fact underestimated.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Model Reduction and Neural Networks · Topic Modeling
