miniCTX: Neural Theorem Proving with (Long-)Contexts
Jiewen Hu, Thomas Zhu, Sean Welleck

TL;DR
miniCTX is a new benchmark for neural theorem proving that emphasizes the importance of long, real-world contexts, testing models' ability to utilize extensive information not seen during training.
Contribution
The paper introduces miniCTX, a challenging benchmark with real-world contexts for neural theorem proving, and provides tools for automatic data extraction and annotation.
Findings
Fine-tuning and prompting outperform traditional methods.
Models effectively utilize context when trained on miniCTX.
miniCTX reveals limitations of previous benchmarks like miniF2F.
Abstract
Real-world formal theorem proving often depends on a wealth of context, including definitions, lemmas, comments, file structure, and other information. We introduce miniCTX, which tests a model's ability to prove formal mathematical theorems that depend on new context that is not seen during training. miniCTX contains theorems sourced from real Lean projects and textbooks, each associated with a context that can span tens of thousands of tokens. Models are tasked with proving a theorem given access to code from the theorem's repository, which contains context that is needed for the proof. As a baseline for miniCTX, we tested fine-tuning and prompting methods that condition theorem proving on preceding context. Both approaches substantially outperform traditional methods that rely solely on state information. We found that this ability to use context is not captured by previous…
Peer Reviews
Decision·ICLR 2025 Oral
1. The $\texttt{miniCTX}$ benchmark fills the gap in the current community: it expands current benchmarks by addressing the limitations of standalone theorem proving and enabling evaluation in real-world scenarios where context is critical. 2. I appreciate the authors' commitment to automatically updating the benchmark and maintaining a temporal split to mitigate the data contamination, given that most LLMs nowadays crawl GitHub and train on it. 3. The authors present details for constructing
1. $\texttt{miniCTX}$ does not have a valid/test set separation. Though it's not inherently an issue for a benchmark, separating a valid set could make the benchmark less gameable under Goodhart's law. 2. It seems that some problems in $\texttt{miniCTX}$ are with contexts that could be easily "in-lined" and transformed into context-less problems. For example, the example shown in Appendix A.1: one could easily in-line the square function $s$ definition into the lemma s\_eq\_pow\_two and make th
- The paper introduces a useful tool (NTP-toolkit) and benchmark (miniCTX) for evaluating a model's theorem proving capabilities. The Python Lean REPL, provided that it is performant, is another useful component that will enable researchers to more easily evaluate models. - A benchmark of context-dependent theorem proving and a model that performs well on it could potentially be useful for assisting humans using Lean for formal verification. - The idea of file tuning, i.e., training model(tactic
- The differences between benchmarks shown in Table 1 is, in my perspective, superficial. For example, what fundamental reason is there from preventing other theorem-proving extraction tools from extracting a timestamp or saving the file name that a theorem is extracted from? - There seem to be a few missing experiments. First, since the temporal split is emphasized, is there any empirical evidence to support the failure of other benchmarks to handle this properly? Second, for file-tuning, have
* This work provides a new dataset for the theorem proving, which tests the model's ability to prove formal mathematical theorems that depend on new context that is not seen during training. And the ability to use context is not captured by previous benchmarks such as miniF2F. * The work provides baseline results on miniCTX, telling the difficulty of the proposed miniCTX. * The work provides detailed analysis on the experiments. * The work provides details of the miniCTX dataset and samples in t
* For the Table 3, how do you test the File tuning Model result on miniF2F, as the miniF2F only has the formal statement? You may: 1) only give the formal statement for File tuning Model and get the result; 2) Or, you provide addtional context information by any way?; * For Table 3, you do not report the result of GPT-4o (full proof), could you explain why? Previous works (DSP, LEGO-Prover or Lyra) all reports the results GPT-4 on miniF2F.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Explainable Artificial Intelligence (XAI)
