LeanDojo: a toolkit for developing and evaluating ReProver and other LLM-based theorem provers

July 7, 2023

LeanDojo is an open-source Lean playground that provides resources, data, models, and benchmarks for theorem proving. It enables the use of large language models (LLMs) for proving math statements in Lean, a software for writing and checking math proofs. It also helps to develop ReProver, the first LLM-based prover that uses retrieval from Lean’s math library to generate tactics (proof steps).

The new model was proposed by a research team from Caltech, Nvidia, MIT, UC Santa Barbara, and UT Austin.

ReProver (Retrieval-Augmented Prover) is the first LLM-based prover that is augmented with retrieval for selecting premises from a vast math library. It was developed using data and toolkits from LeanDojo. The prover does not rely on any proprietary datasets and has a permissive MIT license.

Theorem proving is a challenging task

Theorem proving is the process of using logic and rules to show that mathematical or logical statements are true. It has many applications in fields like computer science, mathematics, and engineering.

Theorem proving is a challenging task. There are many possible ways to prove a statement, and some of them may be very difficult or intricate. Also, you need a lot of knowledge and skills to create a proof.

Therefore, interactive theorem proving (ITP) has emerged as an alternative paradigm. In ITP, proofs are constructed with the help of proof assistants, which are software tools that can check and automate proofs (such as Coq, Isabelle and Lean). However, theorem proving is still a hard problem.

The pros and cons of using LLMs for theorem proving

ML methods, especially LLMs, can help to make theorem proving more efficient and effective. But existing LLM-based methods are not easy to reproduce or build on, because they have private code, data, and large compute requirements.

They also do not use the rich information available in the proof assistant’s math library, which contains a large collection of premises (such as lemmas and definitions) that are essential for proving theorems.

LeanDojo makes theorem proving easier and more accessible

The open-source LeanDojo platform addresses these challenges and makes theorem proving easier and more accessible.

It extracts data (proof states, tactics, premises, etc.) from Lean repos and lets you interact with it using code.

The team created Retrieval-Augmented Prover (ReProver), the first prover that uses LLMs and retrieval to choose premises from a large math library. Premises are existing facts, rules, lemmas, and definitions useful for proving a theorem.

Using LeanDojo, the team constructed a new benchmark with 96,962 theorems and proofs from Lean’s math library. It is one of the most extensive datasets for theorem proving in mathematics.

The figure below illustrates how LeanDojo and ReProver work together to facilitate theorem proving with retrieval-augmented language models.

The top right shows the LeanDojo toolkit. It extracts data (theorems, proofs, tactics, premises) from Lean and interacts with Lean’s proof environment. The extracted data is used to create a benchmark for training and evaluating the ML models. After training, the ML models can prove theorems by interacting with Lean.

The top left shows the proof tree of the theorem: ∀ n ∈ N, gcd n n = n, which states that the greatest common divisor (gcd) of any natural number n and itself is n. The proof tree begins with the theorem as the root state and applies tactics (the edges) to break down states into simpler sub-states, until all states are solved (the leaf nodes). Some tactics use premises such as mod_self and gcd_zero_left from the math library.

mod_self is a theorem that says that if you divide any natural number by itself, the remainder is zero. For example, 7 divided by 7 has a remainder of zero. This is true for any natural number, except zero, because you cannot divide by zero.
gcd_zero_left is a theorem that says that the greatest common divisor (gcd) of zero and n natural number, the result is n. For example, the gcd 0 5 = 5. This is true for any natural number, except zero, because the gcd 0 0 = undefined.

The bottom shows how the ReProver model works. Given a state, it retrieves premises from the math library, which are concatenated with the state and fed into an encoder-decoder Transformer to generate the next tactic.

A state is a description of what you want to prove and what you have proved so far. In our case we want to prove that ∀ n ∈ N, gcd((k+1)%(k+1)) (k+1) = k+1. The ReProver outputs the best tactic (command or step) for this state. It does this by following three main steps:

It searches the relevant premises in the math library (e.g. mod_self, gcd_zero_left).
It combines the state & the premises into one piece of information feeds it into an encoder-decoder Transformer.
It outputs the best tactic for the state (in our case it outputs the “rewrite mod_self” tactic.

LeanDojo turns Lean into a flexible environment (see picture below). The user can see the proof state, run tactics to modify it, and receive feedback on errors or success.

The ReProver model uses retrieval to find relevant premises from the Lean’s math library and then generates tactics based on them.

Training and evaluation

ReProver was trained and evaluated on the LeanDojo Benchmark consisting of theorems, premises, and tactics extracted from mathlib. The Benchmark was divided into three subsets: train, valid, and test. The training took 5 days on a single GPU.

The table below shows the evaluation results. ReProver could prove 51.4% theorems, outperforming a baseline that generated tactics directly without retrieval (47.5%) and another baseline using GPT-4 to generate tactics in a zero-shot manner (28.8%).

Method	random	novel_premises
tidy	23.8	5.4
GPT-4	28.8	7.5
ReProver (ours)	51.4	26.2
w/o retrieval	47.5	22.9

Theorem proving on the testing data of LeanDojo Benchmark

ReProver performed well on two existing datasets, MiniF2F and ProofNet, compared to other methods without reinforcement learning. It also proved 65 theorems that didn’t have proofs in Lean, showing its usefulness for expanding math libraries.

Conclusion

LeanDojo is an open-source Lean playground for theorem proving that provides toolkits, data, models, and benchmarks for learning-based theorem provers in Lean.

ReProver, which was developed using LeanDojo’s data and toolkits, is the first retrieval-augmented Language Model (LLM) designed specifically for theorem proving.

LeanDojo is an important advancement for math proof, as it removes the barriers of private code, data, and extensive computational requirements.