Benchmarking Testing in Automated Theorem Proving
Jongyoon Kim, Hojae Han, Seung-won Hwang

TL;DR
This paper introduces a new semantic testing framework for formal theorem proving, using compilation success of dependent theorems as a correctness measure, and benchmarks models on real-world Lean 4 repositories.
Contribution
It proposes T, a test-based evaluation framework for semantic correctness in theorem proving, and constructs a benchmark dataset from real-world repositories.
Findings
State-of-the-art models achieve high compilation success but low semantic correctness.
The best model achieves only 38.9% accuracy under the new semantic metric.
Current models have a significant gap in generating semantically correct theorems.
Abstract
Recent advances in large language models (LLMs) have shown promise in formal theorem proving, yet evaluating semantic correctness remains challenging. Existing evaluations rely on indirect proxies such as lexical overlap with human-annotated proof, or expensive manual inspection. Inspired by the shift from lexical comparison to test-based evaluation in code generation, we propose T , a framework that evaluates the semantic correctness of formal theorems: a generated theorem is considered correct only if all dependent successor theorems compile successfully, analogous to integration testing. We construct a benchmark from 5 real-world Lean 4 repositories, comprising 2,206 problems paired with 41 successor theorems on average, automatically extracted without human effort. Experiments demonstrate that while state-of-the-art models achieve high compilation success, they perform significantly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
