Do We Need Frontier Models to Verify Mathematical Proofs?
Aaditya Naik, Guruprerana Shabadi, Rajeev Alur, Mayur Naik

TL;DR
This paper evaluates the proof verification capabilities of various language models, demonstrating that smaller models can match frontier models' performance with specialized prompts, improving reliability and accuracy.
Contribution
It reveals that smaller open-source models can verify proofs as effectively as frontier models when guided by optimized prompts, challenging assumptions about model size requirements.
Findings
Smaller models are up to 10% less accurate than frontier models.
Specialized prompts improve smaller models' accuracy by up to 9.1%.
Ensemble prompts enable smaller models to match frontier model performance.
Abstract
Advances in training, post-training, and inference-time methods have enabled frontier reasoning models to win gold medals in math competitions and settle challenging open problems. Gaining trust in the responses of these models requires that natural language proofs be checked for errors. LLM judges are increasingly being adopted to meet the growing demand for evaluating such proofs. While verification is considered easier than generation, what model capability does reliable verification actually require? We systematically evaluate four open-source and two frontier LLMs on datasets of human-graded natural language proofs of competition-level problems. We consider two key metrics: verifier accuracy and self-consistency (the rate of agreement across repeated judgments on the same proof). We observe that smaller open-source models are only up to ~10% behind frontier models in accuracy but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
