IsarStep: a Benchmark for High-level Mathematical Reasoning

Wenda Li; Lei Yu; Yuhuai Wu; Lawrence C. Paulson

arXiv:2006.09265·cs.LO·March 25, 2021

IsarStep: a Benchmark for High-level Mathematical Reasoning

Wenda Li, Lei Yu, Yuhuai Wu, Lawrence C. Paulson

PDF

2 Repos 1 Video

TL;DR

This paper introduces IsarStep, a benchmark dataset for high-level mathematical reasoning, and evaluates neural models' capabilities in generating human-readable proofs, highlighting the potential and challenges of current approaches.

Contribution

The paper presents a new non-synthetic, comprehensive dataset for mathematical reasoning and proposes a hierarchical transformer model that outperforms standard baselines.

Findings

01

Neural models can learn non-trivial mathematical reasoning.

02

Hierarchical transformer outperforms baseline models.

03

The task is challenging but feasible for advanced neural architectures.

Abstract

A well-defined benchmark is essential for measuring and accelerating research progress of machine learning models. In this paper, we present a benchmark for high-level mathematical reasoning and study the reasoning capabilities of neural sequence-to-sequence models. We build a non-synthetic dataset from the largest repository of proofs written by human experts in a theorem prover. The dataset has a broad coverage of undergraduate and research-level mathematical and computer science theorems. In our defined task, a model is required to fill in a missing intermediate proposition given surrounding proofs. This task provides a starting point for the long-term goal of having machines generate human-readable proofs automatically. Our experiments and analysis reveal that while the task is challenging, neural models can capture non-trivial mathematical reasoning. We further design a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

IsarStep: a Benchmark for High-level Mathematical Reasoning· slideslive

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Multi-Head Attention · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Byte Pair Encoding