MLFMF: Data Sets for Machine Learning for Mathematical Formalization

Andrej Bauer; Matej Petkovi\'c; Ljup\v{c}o Todorovski

arXiv:2310.16005·cs.LG·October 25, 2023·2 cites

MLFMF: Data Sets for Machine Learning for Mathematical Formalization

Andrej Bauer, Matej Petkovi\'c, Ljup\v{c}o Todorovski

PDF

Open Access 1 Repo 1 Video

TL;DR

MLFMF introduces extensive datasets derived from formalized mathematics libraries in Agda and Lean, enabling benchmarking of recommendation systems to support mathematical formalization efforts.

Contribution

This work provides the largest collection of formalized mathematical data sets for machine learning, including network and syntax representations from major proof assistant libraries.

Findings

01

Baseline results with standard embeddings and algorithms demonstrate the datasets' utility.

02

The methodology is adaptable to other libraries and proof assistants.

03

The collection supports future research in machine learning for formalized mathematics.

Abstract

We introduce MLFMF, a collection of data sets for benchmarking recommendation systems used to support formalization of mathematics with proof assistants. These systems help humans identify which previous entries (theorems, constructions, datatypes, and postulates) are relevant in proving a new theorem or carrying out a new construction. Each data set is derived from a library of formalized mathematics written in proof assistants Agda or Lean. The collection includes the largest Lean~4 library Mathlib, and some of the largest Agda libraries: the standard library, the library of univalent mathematics Agda-unimath, and the TypeTopology library. Each data set represents the corresponding library in two ways: as a heterogeneous network, and as a list of s-expressions representing the syntax trees of all the entries in the library. The network contains the (modular) structure of the library…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ul-fmf/mlfmf-data
noneOfficial

Videos

MLFMF: Data Sets for Machine Learning for Mathematical Formalization· slideslive

Taxonomy

TopicsMathematics, Computing, and Information Processing · Data Mining Algorithms and Applications · Logic, programming, and type systems

MethodsSparse Evolutionary Training · Lib