Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness
Ivo Petrov, Jasper Dekoninck, Dimitar I. Dimitrov, Martin Vechev

TL;DR
This paper introduces ProofRank, a benchmark for evaluating LLM-generated mathematical proofs beyond correctness, focusing on qualities like conciseness, simplicity, diversity, and adaptivity.
Contribution
The work defines concrete proof quality components, creates the ProofRank benchmark, and demonstrates that proof quality varies independently of correctness in LLMs.
Findings
ProofRank reveals substantial differences in proof quality across models.
Proof quality metrics often trade off with correctness.
Current benchmarks overlook important aspects of proof usefulness.
Abstract
Large language models (LLMs) have become capable mathematical problem-solvers, often producing correct proofs for challenging problems. However, correctness alone is not sufficient: mathematical proofs should also be clear, concise, insightful, and transferable to other problems. While this proof quality is subjective and depends on the reader and context, many of its components are concrete and broadly valued. In this work, we identify such components and introduce ProofRank, a benchmark curated from challenging mathematical competitions. ProofRank evaluates several scalable proxies of proof quality: (i) conciseness, measuring whether proofs avoid unnecessary steps; (ii) computational ease, measuring the extent to which a proof relies on tedious calculations; (iii) cognitive simplicity, measuring how accessible the used proof techniques are; (iv) diversity, measuring how varied a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
