TL;DR
ProofGrid is a comprehensive benchmark suite that evaluates large language models' reasoning abilities through machine-checkable proofs in minimal formal notation, providing fine-grained, reproducible assessments.
Contribution
The paper introduces ProofGrid, a novel benchmark with formal proofs in NDL, enabling precise evaluation of LLM reasoning beyond final answers, and develops a new proof-checking pipeline.
Findings
Models perform well on foundational tasks but struggle with complex reasoning.
ProofGrid reveals epistemic instability where models generate flawed proofs but reject incorrect inferences.
The evaluation framework includes advanced analysis methods like IRT and Fisher information.
Abstract
We introduce ProofGrid, a benchmark suite for evaluating LLM reasoning through machine-checkable proofs rather than final answers alone. ProofGrid contains 15 tasks spanning proof writing, proof checking, proof masking, and proof gap-filling. Tasks are expressed in minimal formal notation, especially NDL, a compact natural-deduction language that fits in short prompts and supports precise, auditable verification. This yields mechanical, reproducible, and fine-grained evaluation rather than judgments by humans or LLMs. ProofGrid covers a calibrated difficulty spectrum, from foundational reasoning tests to structurally rich challenge tasks that no current model solves, while minimizing reliance on domain knowledge, solver delegation, and long-context artifacts. We also develop a comparative framework for reasoning benchmarks and use it to situate ProofGrid relative to existing work in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
