TL;DR
This paper introduces a reinforcement learning guidance system for automated theorem proving that effectively generalizes from minimal data to find longer proofs, outperforming traditional methods on complex benchmarks.
Contribution
The work presents a novel RL-based approach that generalizes from limited training data to produce longer proofs, demonstrating competitive performance on challenging datasets.
Findings
FLoP successfully generalizes from a single proof to related problems.
FLoP can find longer proofs than existing systems on benchmark datasets.
The approach requires limited search due to its effective guidance.
Abstract
We present a reinforcement learning (RL) based guidance system for automated theorem proving geared towards Finding Longer Proofs (FLoP). Unlike most learning based approaches, we focus on generalising from very little training data and achieving near complete confidence. We use several simple, structured datasets with very long proofs to show that FLoP can successfully generalise a single training proof to a large class of related problems. On these benchmarks, FLoP is competitive with strong theorem provers despite using very limited search, due to its ability to solve problems that are prohibitively long for other systems.
| Name | Theory | Size | Description |
| RA-1 | RA | 1800 | Expressions of the form , , where (Examples: or .) |
| RA-2 | RA | 1000 | , where , and is a random expression with operators and operands such that (E.g.: .) |
| RA-3 | RA | 1000 | , where and are random expressions with operators and operands such that E.g. .) |
| LCL-Eq | LCL | 890 | TPTP domain: Logic Calculi (Equivalential) – extended with lemmata from E prover. |
| LCL-Imp | LCL | 1204 | TPTP domain: Logic Calculi (Implication/Falsehood 2 valued sentential) – extended with lemmata from E prover. |
| Dataset | Random | Vampire | E1 | E2 | E3 | leanCoP | rlCoP | FLoP |
| RA-1 | 1.0 | 1.0 | ||||||
| RA-2 | 1.0 | |||||||
| RA-3 | 1.0 |
| Training problem | Succ. | Len. |
| 0.67 |
| Dataset | Vampire | E | leanCoP | rlCoP | FLoP |
| RA-1 | 1.0 | ||||
| RA-2 | 1.0 |
| Prover | Eval | LCL-Eq | LCL-Imp | RA-1 | RA-2 |
| plCoP | MCTS | 47% | 61% | 65% | 48% |
| plCoP | Eager | 5% | 5% | 82% | 49% |
| FLoP | MCTS | 19% | 24% | 61% | 31% |
| FLoP | Eager | 19% | 27% | 100% | 99% |
| Dataset | Curriculum | No curriculum |
| LCL-Eq | 0.24 (0) | 0.23 (0.001) |
| LCL-Imp | 0.51 (0.002) | 0.45 (0.003) |
| Data | Proof | Supervised | Curriculum |
| Lengths | Succ. | Succ. | |
| RA-1 | , | 1(0.01) | |
| , | 0.98(0.01) | ||
| RA-2 | , , | 0.85(0.04) | |
| , , | 0.76(0.01) |
| Name | Axiom |
| zero successor | |
| different successors | |
| plus zero | |
| plus successor | |
| mul zero | |
| mul successor | |
| equality reflexivity | |
| equality symmetry | |
| equality transitivity | |
| congruence of s | |
| congruence of plus | |
| congruence of mul |
| Name | Axiom |
| zero successor | |
| one successor | |
| different successors | |
| predecessor | |
| plus zero | |
| plus one1 | |
| plus one2 | |
| plus one3 | |
| plus one3 | |
| plus more1 | |
| plus more2 | |
| plus more3 | |
| mul zero1 | |
| mul zero2 | |
| mul one1 | |
| mul one2 | |
| mul more | |
| equality reflexivity | |
| equality symmetry | |
| equality transitivity | |
| congruence of b | |
| congruence of plus | |
| congruence of mul |
| Dataset | Prover | MCTS | Eager Policy | Eager Value |
| LCL-Eq | plCoP | 47% | 5% | 1% |
| LCL-Imp | plCoP | 61% | 5% | 1% |
| RA-1 | plCoP | 65% | 82% | 3% |
| RA-2 | plCoP | 48% | 49% | 5% |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
11institutetext: Alfréd Rényi Institute of Mathematics, Budapest 22institutetext: Eötvös Loránd University, Budapest 33institutetext: University of Warsaw 44institutetext: Google Inc. 55institutetext: University of Innsbruck 66institutetext: Czech Technical University in Prague
Towards Finding Longer Proofs
Zsolt Zombori 1122
Adrián Csiszárik 1122
Henryk Michalewski 3344
Cezary Kaliszyk 5533
Josef Urban 66
Abstract
We present a reinforcement learning (RL) based guidance system for automated theorem proving geared towards Finding Longer Proofs (FLoP). Unlike most learning based approaches, we focus on generalising from very little training data and achieving near complete confidence. We use several simple, structured datasets with very long proofs to show that FLoP can successfully generalise a single training proof to a large class of related problems. On these benchmarks, FLoP is competitive with strong theorem provers despite using very limited search, due to its ability to solve problems that are prohibitively long for other systems.
Keywords:
automated theorem proving machine learning reinforcement learning connection calculus
1 Introduction
Automated Theorem Proving (ATP) is the study of using machines for formal mathematical reasoning. It is related to general game playing, for example, the game of Go can be viewed as a simple formal system. Building on the recent success of machine learning, a growing trend in this field is to use learning methods to make theorem provers more powerful. Several research projects have shown that learning can be used to replace/surpass human-engineered heuristics. Despite huge improvements, interesting mathematical theorems remain elusive today. One crucial shortcoming of ATP systems is that they can typically find only relatively short proofs.
In this paper, we address this shortcoming and ask the question of how machine learning can be used to solve problems requiring very long inference chains. We argue that the fundamental reason why current ATP systems are limited to short proofs is that they focus on the search aspect of the task in the space of inference steps. It is very natural to see theorem proving as a search problem: each proof step involves a choice from a set of valid inferences, yielding a search space that grows exponentially with the length of the proof. Due to the exponential blowup, the search is bound to fail beyond a certain depth – except for special classes of problems where one of the smart human heuristics of the theorem prover allows for finding the solution without a search. As W. W. Bledsoe observed [8]: “Automated theorem proving …is not the beautiful process we know as mathematics. This is ‘cover your eyes with blinders and hunt through a cornfield for a diamond-shaped grain of corn’.”
Approaches that try to avoid excessive search broadly fall into three categories: 1) Perform large steps, such as the invocation of tactics, decision procedures in SMT solvers [5], or other complex algorithms. This approach is widely used in interactive theorem provers, e.g. [14, 6, 31]. 2) Perform hierarchical reasoning by first creating a high-level proof plan and then gradually refine it to the calculus level, e.g. [11, 30]. 3) Reason by analogy, e.g. [29, 9].
Reasoning by analogy involves observing the proof of one problem, extracting the core idea, and successfully applying it to another. Note that using this formulation, success is barely dependent on proof length. On the other hand, establishing mappings between proofs is challenging and depends heavily on a proper data representation, which has been from the beginnings of ATP a major bottleneck for this approach. However, with the advent of machine learning methods capable of automatically discovering good data embeddings, the analogy approach seems worth revisiting.
In this work, we interpret analogical reasoning as building a model that internalises a proof and then successfully applies it to a class of related problems, without relying much on search. The trained model is supposed to "know" the proof of an unseen, yet familiar problem. This is a highly simplified approach which does not capture the full potential of analogy, but we argue that it is a meaningful start that will hopefully lead to more refined solutions. We select classes of problems where proofs are highly similar and trained humans can often generalize a single demonstration to the entire class, even if proof lengths greatly differ. We explore whether machine learning can yield similar generalization.
Many successful ATP systems, such as [53, 18, 12, 3, 20, 57, 32, 37] implement the MaLARea [51, 53] learning/reasoning loop (described later also as the DAgger [44] meta-algorithm). The MaLARea loop interleaves ATP runs based on the current models (data collection phase) with a training phase, in which these models are updated to fit the collected data.
An alternative family of reinforcement learning methods, including Temporal Difference (TD) learning [50], continuously update their models, allowing the system to bootstrap on itself. Such methods have so far been mostly ignored by the theorem proving community. In these methods, the search is usually replaced by rollouts. While MaLARea has been shown to yield good search heuristics, we argue that rollout based data collection is more suitable when the aim is to fully explore the space around a single problem without overfitting to it. Our work has the following contributions.
- •
We introduce a new theorem proving algorithm FLoP (Section 3) based on a TD algorithm 111In particular, we use Proximal Policy Optimization [46] (PPO), a variant of the policy gradient method, which uses Temporal Difference learning for optimization of the value function. and the connection tableau calculus [7]. FLoP makes use of the curriculum learning algorithms presented by [41] and [45]. These techniques are well established in RL, however, they have never been applied to theorem proving before.
- •
We introduce a synthetic dataset of increasingly difficult arithmetic problems, as well as two datasets from the Logical Calculi domain of the TPTP [49] library, augmented with lemmata (Section 4).
- •
We show that when restricted to single shot evaluation – without search – FLoP performs very well, while another prover based on guided Monte Carlo Tree Search greatly degrades.
- •
We evaluate FLoP on our arithmetic benchmarks by training it on a single problem and show that it generalizes very well even when evaluated without search, allowing just a few proof attempts. This suggests that it has learned a simple form of reasoning by analogy.
- •
We use the arithmetic benchmarks to compare FLoP with state-of-the-art provers Vampire [26], E [47], leanCoP [33] guided by human-designed strategies, and with rlCoP [20] – an RL-based connection tableau prover. In the simple setup of unary encoding of numbers, FLoP is only outperformed by a portfolio (multi-strategy) mode of a single manually optimized rewriting-based system and only after trying several of its autoconfiguration heuristics. When using binary encoding, FLoP performs best, demonstrating its ability to generalize to long proofs.
Our datasets presented in Section 4 seem to be particularly suited for machine learning methods: some problems are algorithmically simple, with long solutions and strong shared structure (Robinson Arithmetic) while others are less similar, but hierarchically structured (Logical Calculi). Nevertheless, state-of-the-art systems struggle with solving some of the problems (see Section 5). Furthermore, our problems are much easier to analyze than typical heterogeneous proof corpora, hence promising better understanding of the current limits of theorem provers. The difficulty of our synthetic problems, as well as the proof lengths, are easily adjustable, yielding a scalable RL benchmark with interpretable failure modes (see Appendix 0.E).
Our code, datasets and all experiment configuration files are available at http://bit.ly/code_atpcurr222This distribution does not include the fCoP theorem prover, which cannot yet be publicly released, however, a binary can be obtained upon request.. Supplementary materials including screencasts with gameplays performed in our environments are available at the project webpage http://bit.ly/site_atpcurr.
2 Related work
Theorem Proving by Analogy.
Analogy has long been considered one of the most important heuristics in mathematical problem solving, e.g. [39, 38]. It also gained attention in automated theorem proving, e.g.[9, 29], as an alternative of search-based methods. [9] defines analogical reasoning as “the proof of one theorem is used to guide the proof of a similar theorem by suggesting analogous steps”. They rely on a user-provided matching between analogous concepts related to the two theorems and try to reuse the proof steps (adjusted modulo analogy) in the source proof during the construction of the target. [29] aim to achieve this on a higher level of abstraction by matching proof plans of a source and a target problem. As the proof plan of the target problem is constructed, the plan of the source is searched for steps that can be transformed into a suitable step for the target. The set of allowable transformations are predefined and designed for a narrow domain. For example, the transformations given in [29] aim to carry a result, such as the Heine Borel theorem, stated in over to . The characteristic feature of these systems is that search is performed on the meta level of plan mappings and proof step transformations. The search space is often defined ad hoc and is much smaller than that given by the inference rules of the calculus.
A machine learning system that is trained to guide a theorem prover is supposed to achieve a similar result, with two important improvements. First transformations are learned, without the need for manual engineering. Second, establishing mappings between proof steps (that can be transformed into each other) should result from learning of flexible and abstract features. The flexibility and abstraction allows for potentially reusing the same proof components several times, as well as using components from different proofs, which goes beyond earlier attempts that only establish direct matching between the two proofs.
Machine learning systems for guiding theorem provers.
A large body of research exists that aims to provide guidance for theorem provers via machine learning. FEMaLeCoP [19], rlCoP [20, 32] plCoP [57] and lazyCoP [40] guide the leanCoP [33] compact connection tableau prover, which is also the system guided in our project. Learning based guidance is added to the saturation based E prover [47] in [27, 17, 12, 16]. The HOList project [4, 34, 3] builds guidance on the tactic level333A tactic is a human-designed program which aggregates multiple proof steps. for the HOL Light [14] higher-order theorem prover. A distinctive feature of all these systems is that they rely heavily on an external search procedure, such as Monte Carlo Tree Search [25], or the search engine of the guided prover. Learning is aimed at making search more efficient and it is implemented in alternating iterations of proof search and model fitting, according to the DAgger [44] meta-algorithm, first used in MaLARea [53] for theorem proving. In contrast with the above, we use an algorithm which uses bootstrapping and learns from generated rollouts, aiming to learn to generate entire proof sequences in the leanCoP calculus. Such rollout based learning has so far been barely used in theorem proving, with the noteable exception of [13], developed in parallel with FLoP, which guides a saturation style prover using a simple policy gradient RL algorithm.
Concurrently with our work, [36, 52, 37] have used recurrent neural networks, attention and transformers to generate next proof steps. E.g., [37] report generalisation on problems with relatively short proofs. In line with emphasizing analogy over search, their evaluation protocol only allows for very limited search in a single proof attempt 444A maximum of 4096 search nodes are allowed.. Our work employs much smaller neural models and focuses on generalizing to proofs with hundreds and thousands of steps (see Figure 3).
Provers guiding the leanCoP Connection Tableau Calculus.
As noted above, a series of learning systems guide the leanCoP connection calculus. Of these, we highlight three systems that use roughly the same learning setup: rlCoP [20], plCoP [57] and graphCoP [32]. In these systems, the value and policy functions of the guided MCTS algorithm are learned similarly to [1, 48]. FLoP shares the same manually developed features [22] with rlCoP and plCoP, while graphCoP employs a graph neural network for feature extraction. We use these systems as an important baseline in Section 5. While the differences are important, they play little role in our current investigation and we refer to them jointly as mcts-CoPs.
3 FLoP – Main Algorithm
FLoP combines the connection tableau calculus with guidance based on Temporal Difference and curriculum learning. A brief introduction to the connection tableau calculus is provided in Appendix 0.F. After each inference step, the prover engine returns its current state as well as the set of valid actions, i.e., valid inference steps that transform the current goal. The prover is encapsulated into a Reinforcement Learning (RL) environment. In the following, we provide a brief summary of the relevant RL techniques.
3.1 Reinforcement Learning fundamentals
The RL introduction below is highly selective, aiming to describe Proximal Policy Optimization [46], the method used in FLoP. For further details, we recommend [50].
Markov Decision Process
The mathematical foundation of the class of problems that Reinforcement Learning aims to solve is given by Markov Decision Processes (MDP). An describes a dynamic process and consists of the following components: is the set of states, is the set of actions, is a reward function, is the state transition function and is the discount factor. We assume that an agent interacts with this MDP, generating sequences of state-action-reward tuples, called trajectories. The agent is equipped with a policy function which determines which action it selects in a particular state. The aim of the agent is to maximize its total accumulated reward . Several components of the model can be stochastic: the reward function, the transition function, as well as the policy. In such settings, the aim of the agent it to find the policy that maximizes its cumulative expected reward, where future rewards are discounted with the discount factor:
[TABLE]
Policy Gradient
One successful family of methods solves this task by considering a parametric class of policy functions . We continuously sample trajectories from the current policy and optimize the parameters via gradient descent based on the observed rewards. This is called policy gradient, and the RL literature contains numerous variants that differ in the details of optimization.
One well known difficulty of policy gradient is the large variance in the sampled trajectories, which makes convergence slow and requiring large number of training samples. A popular technique to reduce variance is to train a baseline model that esimates the expected reward from a given state and optimize the policy with respect to the excess reward on top of the baseline. This gives rise to the actor-critic framework. We train two models jointly: a critic: that estimates the expected reward of trajectories starting from given policy and an actor, which is our policy . Given some state , we use the policy to sample an action . We then sample further transitions to estimate the expected reward from state after taking action . We define advantage as the difference between these two expectations: . Our optimization objective is then:
[TABLE]
where and are the parameters of the critic and the actor, respectively.
Proximal Policy Optimization
Policy gradient is an on-policy method, meaning that it optimizes the parameters of the policy based on trajectories sampled from the same policy. In contrast with off-policy methods, which extract samples through some other mechanism, policy gradient learning can be highly unstable. This is because the change in policy potentially invalidates the samples it was trained on. Proximal Policy Optimization [46] (PPO) addresses this problem by introducing a soft constraint on the magnitude of the policy updates.
We maintain two instances of the policy network: which we aim to improve and which we sample from. The ratio of the two policies gives us a measure of difference:
[TABLE]
If this ratio lies outside of the range , then the advantage function is clipped:
[TABLE]
[TABLE]
The two neworks are periodically synchronized to ensure that they are not too different. PPO has been shown to strike a good balance between simplicity and stability and is one of the most popular policy gradient methods.
3.2 Reinforcement Learning in FLoP
Theorem proving can be directly mapped into an MDP by treating prover states as states, inference steps as actions and proof attempts as trajectories. The only missing component is the reward function, which we set to be
[TABLE]
Other reward functions are also possible, though we argue that the selected one is most faithful to the task at hand: 1) we know very little about progress before we have found a proof, hence the zero reward for intermediary steps and 2) it is hard to tell if one proof is better than another, hence the binary nature of the rewards.
Reward maximization directly corresponds to finding a proof. Hence, we augment the core connection tableau calculus with a value (critic) and a policy (actor) model trained using PPO. Classical proof search is then replaced with generating proof attempts from the policy. Figure 1 shows the overall architecture of the system and Figure 2 shows the policy and value network architectures. The state and the actions (formulae) are represented using previously developed features [22]. The features include (suitably hashed) triples, pairs, and singletons of adjacent nodes in the formula trees and the partial proof trees, as well as some global features: number of open goals, number of symbols in them, their maximum size and depth, length of the current path, and two most frequent symbols in open goals. This means that the proof states and the actions are presented as (sparse) fixed-length vectors.
3.3 Curriculum Learning
A fundamental challenge for an RL system that learns to prove theorems from its own exploration is that rewards are sparse and binary. In case proofs are long, this makes learning nearly impossible. To tackle this, we use curriculum learning on the length of proofs in case a proof is available. Initially, we start exploration from near the end of the proof, making it easy to succeed and obtain positive reward. As the system gets more confident, we gradually move the starting state backwards along the given proof. This approach has already been successfully applied in many RL experiments. When there is no good alternative to the training proof, the system eventually learns those steps, while random exploration helps to identify alternatives and find novel proofs. Exploration also helps to learn steps that make the proof impossible to finish. We can start learning with or without training proofs. Each training problem can have its own curriculum schedule, which can be restarted when a new proof is found. Curriculum learning is an efficient tool for boosting rewards found during exploration.
3.4 Training algorithm
Algorithm 1 gives an overview of the learning loop. First, in line 5 we sample a problem (in case there are multiple). In lines 6–9 we interact with the prover and ensure that its state corresponds to the one dictated by the current curriculum. In lines 10–15 we generate a rollout (proof attempt) iterating 1) prover steps, 2) featurization, and 3) sampling a next action according to the policy. If a new problem is solved, we start the curriculum on it in lines 17–19. If performance on a given problem and curriculum reaches a threshold, we advance the curriculum in lines 20–22. In line 24 we update the policy and value models based on the specified number of episodes.
3.5 Implementation details
Most of the FLoP system is implemented in the Python programming language, using the [15] RL framework. FLoP guides the fCoP [21] system, which is a reimplementation of leanCoP in the OCaml programming language. The communication between the guidance and prover components is provided via the C foreign language interface.
4 Datasets
To evaluate our system, we select simple classes of theorems with strong shared structure, giving a large room for learning-based improvement. Our five datasets are described in Table 1. The datasets are bundled into an OpenAI-gym [10] compliant environment and can be tested with modern RL algorithms.
Three datasets are built on the theory of Robinson Arithmetic [43], which defines addition and multiplication on the nonnegative integers. Despite its relative simplicity, this theory seems to be particularly suited for machine learning methods: solutions are long and repetitive, while also challenging for state-of-the-art systems (see Section 5). We examine both unary (24 actions) and binary (40 actions) encoding of numbers. The axioms of Robinson Arithmetic are given on the project webpage. Increasing the numbers in the conjecture greatly increases the length of the proof, making this dataset suitable for detecting the length boundary of various theorem provers.
Two datasets are extracted from the TPTP library, from the domain of Logical Calculi with condensed detachment (LCL). These theorems have been extensively studied from the early days of automated theorem proving, e.g. [28, 35, 23, 56]. We run E prover with a large time limit on the problems and augment the dataset with lemmata extracted by E. As a result, many proofs of simpler problems can be directly used as parts of the proofs of harder problems. A direct analogy from one problem to the other is usually not possible, however, shallow search is often sufficient to connect the proofs of easier problems to the proof of harder ones.
The LCL domain in the TPTP [49] library consists of statements about various formal inference systems. LCL-Eq and LCL-Imp formalize properties of the Equivalential Calculus and the Implication and Falsum Calculus, respectively. Both are subsystems of the classical propositional calculus, restricting the set of allowed connectives to and . For both subsystems, the appropriate variant of the condensed detachment inference rule ( and ) constitutes a strongly complete inference system, i.e., whenever a formula semantically follows from a set of premises, it also follows from the set syntactically. A number of complete axiomatizations of both the Equivalential Calculus and the Implication and Falsum Calculus exist and the theorems in our datasets establish connections between them.
All arithmetic problems in our dataset are quite simple for humans, but in the case of logical calculi, some of the problems were posing a challenge for mathematicians (see [55]).
5 Experiments
Our experiments with Robinson arithmetic aim to demonstrate that in this highly structured dataset FLoP is capable of extracting a general proof pattern from one or two proofs and generalizing to related proofs of arbitrary length, using a restricted few-shot evaluation method (see below). Experiments 1, 2, and 3 compare FLoP with strong theorem provers using different fragments of the arithmetic dataset, varying the complexity of the axiomatization (unary vs. binary encoding of numbers) and the complexity of the target theorems (RA-1, RA-2, RA-3). FLoP is either the best or the second-best in each experiment. In each of these experiments, FLoP is allowed proof attempts without backtracking: the first attempt is a deterministic run with a high time limit ( sec) that always selects the action maximizing the policy and the remaining runs are stochastic samples from the policy with a time limit of sec.
The LCL problems used in our experiments are less structured and success is dependent on search, even if the hierarchical composition of problems ensures that a relatively small search is sufficient to generalize from easier problems to harder ones. Consequently, we expect that search-based methods are better in this domain. However, when search is completely disallowed during evaluation, we show in Experiment 4 that FLoP performs much better than the mcts-CoPs. Finally, in Experiments 5 and 6 we demonstrate the benefit of using curriculum learning.
Experiment 1: Comparison with other provers.
We compare FLoP with a random model, two state-of-the-art saturation-style theorem provers (E 2.4, Vampire 4.3.0), a heuristic guided connection tableau prover (leanCoP 2.1), and rlCoP (one of the mcts-CoPs). Vampire, E, and leanCoP use human-designed strategies instead of learning. We use these provers in the configuration used for CASC, the yearly competition of fully automated theorem provers, employing a time limit of sec. per problem. For E, we also report the results of the auto-schedule mode. For rlCoP we used the hyperparameters described in [20], only modifying the policy temperature from to , as this works better with the Robinson datasets. The number of inferences in MCTS was limited to . rlCoP was trained on the whole evaluation set, while FLoP was trained on a single problem: and for RA-1 and RA-2, respectively.555For a description of RA-3 training problems, see Experiment 2.
Success ratios are given in Table 2. FLoP is only outperformed by E’s auto-schedule, which tries multiple strategies and finds one with the left-to-right ordering of all the addition and multiplication axioms. This solves all of our problems immediately without proof search by only rewriting to a normal form [2]. This demonstrates the power of equational theorem proving when a suitable term ordering exists and can be found by human-designed heuristics. This is, however, far from guaranteed in general even in such simple domains, as witnessed by Vampire’s failure to find this ordering. To evaluate E without access to its built-in rewriting capability, we have renamed the equality to a new predicate ‘eq’ axiomatized exactly in the same way as in leanCoP. The auto-schedule mode then becomes somewhat weaker than the auto mode, see Table 2.
Experiment 2: Harder Arithmetic Expressions.
RA-3 consists of arithmetic equalities with random expressions on both sides. This dataset is significantly more complex because there are many ways of proving the same problem. Proofs are longer, too. For FLoP, we examined various training sets and found that the system is very prone to overfitting. Most problems can be proven in many different ways, that vary greatly in terms of how well they foster generalization. It is true especially of easier problems that they can be proven with “shortcuts” that hinder generalization (See more on this in Appendix 0.E). The harder the problems, the less likely they can be solved with such heuristic approaches, hence harder training problems promise better training signal. We demonstrate this by training FLoP on a few harder problems with proofs provided, making use of curriculum learning described in Section 3. A single longer training proof is sufficient to yield meaningful generalization. Adding one more training problem helps even more, as shows Table 3.
Figure 3 shows the distribution of the length of proofs found by FLoP. We can see that a large part of the problems requires thousands of steps to solve, highlighting the need to avoid search.
For rlCoP, all RA-3 problems are too hard to solve without guidance within the inference limit, so we started with the version trained on the solutions of RA-2. Table 2 shows that FLoP is only outperformed by E’s auto-schedule mode, which again finds the rewrite ordering that solves all problems without search.
Experiment 3: Binary Number Encoding.
We experiment with Robinson Arithmetic using binary encoding of numbers. This makes the domain theory more complex: the total number of actions increases from 24 to 40. 666Note that only a subset of these is applicable in a given state. On the other hand, proofs get shorter, as shows Figure 3. Again, we train FLoP on a single proof: and for RA-1 and RA-2, respectively. Table 4 shows that provers get weaker, except for Vampire and FLoP. In particular, E is no longer capable of solving the problems with rewriting only. FLoP manages to generalize from a single proof to the whole dataset despite the increased action space and performs best in this experiment.
Experiment 4: Search vs. Eager Evaluation
We compare FLoP with plCoP (one of the mcts-CoPs) using two different evaluation methods. After training both systems on the whole dataset, we evaluate them using 1) MCTS and 2) eager evaluation, i.e. always select the action with the highest probability according to the policy model. Table 5 shows that plCoP performs better when search is allowed, especially for the more heterogeneous LCL problems. However, FLoP takes the upper hand in eager evaluation. For the LCL problems, plCoP collapses while FLoP is unaffected. This suggests that plCoP depends heavily on the search procedure it used for training. FLoP cannot make good use of MCTS, which is somewhat expected, since its policy and value networks were not trained for that purpose. For the arithmetic datasets, both systems benefit from not doing search because they reach proofs that are longer than what MCTS can reach. For FLoP, the removal of the depth limit reveals that it fully mastered the two problem classes, regardless of depth. The performance of plCoP gets even worse if the eager evaluation is based on the value model, see Appendix 0.D. These results are in line with our conjecture that the DAgger approach of plCoP is better for learning good search heuristics, while FLoP is better at internalizing a full proof pattern.
Experiment 5: Curriculum Learning vs only Exploration Based
Learning.
When training proofs are not available, the positive reward signal only occurs after the system solves a problem through exploration. Afterward, curriculum learning ensures that the system is continuously faced with a “reasonably” hard problem, alleviating the sparse reward challenge of theorem proving. We demonstrate this on the two LCL datasets. Here, before generating each rollout, we randomly select a problem from the entire dataset. We report the number of proofs found during training in Table 6. Curriculum learning brings a small, but consistent improvement when compared with only exploration-based learning.
Experiment 6: Curriculum Learning vs. Supervised Learning
When training proofs are available, a natural baseline of curriculum learning is supervised learning on the proof steps. While such behavioral cloning sometimes leads to great performance, we show in Table 7 that it greatly depends on the quality of the given proof. We train RA-1 and RA-2 using the following sets of training problems:
RA-1 , 2. 2.
RA-2 , ,
We take the “nice” proofs (, and steps) of these problems and construct variants with - extra steps added. We observe that supervised learning degrades as superfluous steps are introduced, while FLoP’s exploration allows the system to recover and find the original proofs.
6 Conclusion and Future Work
We have built FLoP, a proof guidance system based on a variant of temporal difference reinforcement learning, addressing the problem of finding long proofs in an exponential search space. Previous work [54, 24] focused on finding long proofs with the help of human-designed heuristics. We showed that FLoP is capable of extracting proof patterns via learning and can generalise to much longer proofs, implementing a simple form of reasoning by analogy. We believe that mastering analogical reasoning is an important step in creating human-level automated mathematicians. We presented a set of theorem proving datasets that are suitably challenging for existing learning methods and are intended to become a general-purpose testing ground for reinforcement learning methods. We showed that FLoP can outperform strong theorem provers on some of these datasets. We find that curriculum learning is a useful component of the learning algorithm as it allows for amplifying training signal when proofs are long.
In this paper, we focused on extracting a single proof pattern during training. A natural continuation will be to extract a portfolio of patterns from a larger pool of training problems. Transformer models are promising tools to achieve this, given their recent success in mastering several tasks in parallel. Transformers might also be capable of producing large chunks of proofs in a single inference step.
7 Acknowledgments
Adrián Csiszárik and Zsolt Zombori were supported by the European Union, co-financed by the European Social Fund (EFOP-3.6.3-VEKOP-16-2017-00002), the Hungarian National Excellence Grant 2018-1.2.1-NKP-00008 and by the Hungarian Ministry of Innovation and Technology NRDI Office within the framework of the Artificial Intelligence National Laboratory Program. The work of Henryk Michalewski was supported by the Polish National Science Center grant UMO-2018/29/B/ST6/02959. Cezary Kaliszyk was supported by ERC grant no. 714034 SMART. Josef Urban was supported by the AI4REASON ERC Consolidator grant number 649043, and by the Czech project AI&Reasoning CZ.02.1.01/0.0/0.0/15_003/0000466 and the European Regional Development Fund.
This research was supported by the PL-Grid Infrastructure. In particular, quantitative results of FLoP reported in this paper were performed using the Prometheus supercomputer, located in the Academic Computer Center Cyfronet in the AGH University of Science and Technology in Kraków, Poland.
Appendix 0.A Robinson Arithmetic
Robinson Arithmetic defines basic properties of arithmetic expressions on the nonnegative integers. The signature of the language contains atom ’’ (representing [math]), functions ’’, ’plus’ and ’mul’ (for , and , respectively), and the equality predicate ’’. For example, formula is written as
[TABLE]
We use the axioms provided in Table 8. The table also contains special axioms that are added by leanCoP to handle the equality predicate. The unary representation of numbers (e.g., represents ) results in large expressions and long proofs as the numbers increase. For example, takes over steps to prove in fCoP. We show an example of such proof on the project website. The total number of actions if 24.
Appendix 0.B Robinson Arithmetic with Binary Encoding
Using binary representation makes the theory of Robinson Arithmetic more complex. Constant symbols and stand for [math] and , while term represents . For example, is represented as . We can see in Table 9 that the number of axioms increases from to . Correspondingly, the number of actions during proof search increases from 24 to 40.
Appendix 0.C Experiment hyperparameters
Our hyperparameters were selected using small grid searches. We checked standard RL parameters (e.g., the discount factor), parameters related to curriculum scheduling (e.g., local vs. global), neural network architectures (1–5 layers with 128–1024 neurons), feature sizes (64–1024) and training steps ( – ). Parameters used in the experiments are described in configuration files which are accessible along with the shared codebase.
Appendix 0.D Eager evaluation based on the value model
Given that the evaluated RL algorithms train both a policy and a value model, an alternative of policy-based eager evaluation is to select the action whose successor state has the highest value score. Table 10 shows, however, that the value-based evaluation is much worse for each dataset. We conjecture that this is because assigning a value to a never observed state is much harder than selecting from a smaller set of actions.
Appendix 0.E Failure Modes
Despite the apparent simplicity of our arithmetic learning environments, a learning system aiming to solve them has to overcome some hard challenges. We have decided to describe these challenges in detail as they are present in other domains as well, even if it may be harder to detect.
Failure type 1. The reward mechanism of our RL system is biased towards shorter proofs. However, many problems have “shortcuts” that allow for shorter proofs, but that do not generalize well. Consider formula . There are two ways to prove this equality: 1) compute the values of the expressions on both sides of the equation and notice that they are the same or 2) show that and . The former generalizes better, but the latter results in a shorter proof. Hence, training on this problem might negatively affect the performance of the prover. This is what prevents FLoP to bootstrap itself in RA-3, i.e., train on easy problems and generalize to harder ones. We find that providing some of the harder problems (having longer proofs) helps to avoid misleading shortcuts.
Failure mode 2. fCoP features do not take into account the order of the arguments of a function, i.e., and have the same features. This is problematic for RA-3, since and require different inferences. We addressed this problem by 1) extending state features with those of the preceding action as a substitute of memory, 2) modified the features to include argument order.
Failure mode 3. Some ”rare” events are hard to generalize because the system sees very few relevant samples during training. This is the case with applying commutativity of equality (replacing with ), which is only required in RA-3 and ideally only once per proof when we move the focus from one side of the equation to the other. In Experiment 4, when we trained on a single longer proof, we have noticed that the system was very unsure about this action which resulted in many failed proof attempts. Adding another training proof was enough to overcome this and the success score increased from to .
Appendix 0.F The leanCoP Connection Tableau Calculus
FLoP provides guidance for of the very compact leanCoP [33] connection tableau calculus. The calculus was originally implemented in Prolog, but it also has an OCaml reimplementation fCoP [21] and FLoP can be used to guide both systems.
We briefly describe the connection tableau calculus, assuming basic first-order logic and theorem proving terminology [42]. The input is a (mathematical) problem consisting of axioms and conjectures formally stated in first-order logic (FOL). The calculus searches for refutational proofs, i.e. proofs showing that the axioms together with the negated conjectures are unsatisfiable. The FOL formulas are first translated to clause normal form (CNF), producing a set of first-order clauses consisting of literals, e.g. . Proof search starts with a start clause as a goal and proceeds by building a connection tableau by repeatedly applying extension steps and reduction steps.
The extension step connects (unifies) the current goal with a complementary literal of a new clause. This extends the current branch, possibly splitting it into several branches if there are more literals in the new clause, and possibly instantiating some variables in the tableau. The reduction step connects the current goal to a complementary literal of the active path, thus closing the current branch. The proof is finished when all branches are closed. The extension and reduction steps are nondeterministic, requiring backtracking in the standard connection calculus. Brute force search such as iterative deepening can be used to ensure completeness. Figure 4 shows a closed connection tableau, i.e., a finished proof tree where every branch contains complementary literals (literals with opposite polarity). This shows that the set of clauses is unsatisfiable.
leanCoP represents theorem proving as a one-person game. The game ends with success if a proof is found. The prover has many choices to make along the way, in particular it can select from several valid extension and reduction steps. Whether a step is valid depends on the unification condition, i.e., if the current goal unifies with the negation of a literal in the corresponding clause. The full information about the game state consists of all previous proof steps, the partial proof tree (proof state) and the current goal.
The search space of the prover is exponentially large in the length of the proof. In leanCoP, the action space is roughly correlated with the size of the axiom set. While this can be large for large problems, typically only a few actions are available in any particular state.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Anthony, T., Tian, Z., Barber, D.: Thinking fast and slow with deep learning and tree search. Co RR abs/1705.08439 (2017), http://arxiv.org/abs/1705.08439
- 2[2] Baader, F., Nipkow, T.: Term rewriting and all that. Cambridge University Press (1998)
- 3[3] Bansal, K., Loos, S.M., Rabe, M.N., Szegedy, C.: Learning to reason in large theories without imitation. Co RR abs/1905.10501 (2019), http://arxiv.org/abs/1905.10501
- 4[4] Bansal, K., Loos, S.M., Rabe, M.N., Szegedy, C., Wilcox, S.: HO List: An environment for machine learning of higher-order theorem proving (extended version). Co RR abs/1904.03241 (2019), http://arxiv.org/abs/1904.03241
- 5[5] Barrett, C.W., Tinelli, C.: Satisfiability modulo theories. In: Clarke, E.M., Henzinger, T.A., Veith, H., Bloem, R. (eds.) Handbook of Model Checking, pp. 305–343. Springer (2018). https://doi.org/10.1007/978-3-319-10575-8_11, https://doi.org/10.1007/978-3-319-10575-8_11 · doi ↗
- 6[6] Bertot, Y., Castran, P.: Interactive Theorem Proving and Program Development: Coq’Art The Calculus of Inductive Constructions. Springer Publishing Company, Incorporated, 1st edn. (2010)
- 7[7] Bibel, W., Eder, E., Fronhöfer, B.: Towards an advanced implementation of the connection method. In: Bundy, A. (ed.) Proceedings of the 8th International Joint Conference on Artificial Intelligence. Karlsruhe, FRG, August 1983. pp. 920–922. William Kaufmann (1983), http://ijcai.org/Proceedings/83-2/Papers/072.pdf
- 8[8] Bledsoe, W.W.: Some thoughts on proof discovery. In: Proceedings of the 1986 Symposium on Logic Programming, Salt Lake City, Utah, USA, September 22-25, 1986. pp. 2–10. IEEE-CS (1986)
