Benchmarking Testing in Automated Theorem Proving

Jongyoon Kim; Hojae Han; Seung-won Hwang

arXiv:2604.23698·cs.CL·April 29, 2026

Benchmarking Testing in Automated Theorem Proving

Jongyoon Kim, Hojae Han, Seung-won Hwang

PDF

TL;DR

This paper introduces a new semantic testing framework for formal theorem proving, using compilation success of dependent theorems as a correctness measure, and benchmarks models on real-world Lean 4 repositories.

Contribution

It proposes T, a test-based evaluation framework for semantic correctness in theorem proving, and constructs a benchmark dataset from real-world repositories.

Findings

01

State-of-the-art models achieve high compilation success but low semantic correctness.

02

The best model achieves only 38.9% accuracy under the new semantic metric.

03

Current models have a significant gap in generating semantically correct theorems.

Abstract

Recent advances in large language models (LLMs) have shown promise in formal theorem proving, yet evaluating semantic correctness remains challenging. Existing evaluations rely on indirect proxies such as lexical overlap with human-annotated proof, or expensive manual inspection. Inspired by the shift from lexical comparison to test-based evaluation in code generation, we propose T , a framework that evaluates the semantic correctness of formal theorems: a generated theorem is considered correct only if all dependent successor theorems compile successfully, analogous to integration testing. We construct a benchmark from 5 real-world Lean 4 repositories, comprising 2,206 problems paired with 41 successor theorems on average, automatically extracted without human effort. Experiments demonstrate that while state-of-the-art models achieve high compilation success, they perform significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.