Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4

Chengwu Liu; Yichun Yin; Ye Yuan; Jiaxuan Xie; Botao Li; Siqi Li; Jianhao Shen; Yan Xu; Lifeng Shang; Ming Zhang

arXiv:2604.15839·cs.AI·April 20, 2026

Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4

Chengwu Liu, Yichun Yin, Ye Yuan, Jiaxuan Xie, Botao Li, Siqi Li, Jianhao Shen, Yan Xu, Lifeng Shang, Ming Zhang

PDF

TL;DR

This paper introduces a new Hard Mode benchmark for automated theorem proving in Lean 4, along with an agentic framework that leverages LLMs for answer discovery and proof generation, setting new performance records.

Contribution

It provides the first Hard Mode benchmark variants and a novel framework that improves theorem proving success rates by combining LLM reasoning with explicit self-reflection.

Findings

01

DAP increases solved problems on CombiBench from 7 to 10.

02

First to formally prove 36 theorems in Hard Mode on PutnamBench.

03

LLMs achieve over 80% answer accuracy, surpassing formal provers.

Abstract

Most ATP benchmarks embed the final answer within the formal statement -- a convention we call "Easy Mode" -- a design that simplifies the task relative to what human competitors face and may lead to optimistic estimates of model capability. We call the stricter, more realistic setting "Hard Mode": the system must independently discover the answer before constructing a formal proof. To enable Hard Mode research, we make two contributions. First, we release MiniF2F-Hard and FIMO-Hard, expert-reannotated Hard Mode variants of two widely-used ATP benchmarks. Second, we introduce Discover And Prove (DAP), an agentic framework that uses LLM natural-language reasoning with explicit self-reflection to discover answers, then rewrites Hard Mode statements into Easy Mode ones for existing ATP provers. DAP sets the state of the art: on CombiBench it raises solved problems from 7 (previous SOTA,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.