TaoBench: Do Automated Theorem Prover LLMs Generalize Beyond MathLib?

Alexander K Taylor; Junyi Zhang; Ethan Ji; Vigyan Sahai; Haikang Deng; Yuanzhou Chen; Yifan Yuan; Di Wu; Jia-Chen Gu; Kai-Wei Chang; Nanyun Peng; Amit Sahai; and Wei Wang

arXiv:2603.12744·cs.LG·March 16, 2026

TaoBench: Do Automated Theorem Prover LLMs Generalize Beyond MathLib?

Alexander K Taylor, Junyi Zhang, Ethan Ji, Vigyan Sahai, Haikang Deng, Yuanzhou Chen, Yifan Yuan, Di Wu, Jia-Chen Gu, Kai-Wei Chang, Nanyun Peng, Amit Sahai, and Wei Wang

PDF

Open Access 1 Datasets

TL;DR

This paper introduces TaoBench, a new benchmark based on Terence Tao's Analysis I, to evaluate the robustness of automated theorem proving systems across different definitional frameworks, revealing significant generalization gaps.

Contribution

The work presents TaoBench, a novel benchmark that tests ATP models on problems constructed from scratch, highlighting their limited ability to generalize beyond standard Mathlib definitions.

Findings

01

State-of-the-art ATP models perform 26% worse on TaoBench compared to Mathlib.

02

Performance drops are mainly due to definitional framework differences, not task difficulty.

03

TaoBench provides a foundation for developing provers aligned with research mathematics.

Abstract

Automated theorem proving (ATP) benchmarks largely consist of problems formalized in MathLib, so current ATP training and evaluation are heavily biased toward MathLib's definitional framework. However, frontier mathematics is often exploratory and prototype-heavy, relying on bespoke constructions that deviate from standard libraries. In this work, we evaluate the robustness of current ATP systems when applied to a novel definitional framework, specifically examining the performance gap between standard library problems and bespoke mathematical constructions. We introduce TaoBench, an undergraduate-level benchmark derived from Terence Tao's Analysis I, which formalizes analysis by constructing core mathematical concepts from scratch, without relying on standard Mathlib definitions, as well as by mixing from-scratch and MathLib constructions. For fair evaluation, we build an agentic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

uclanlp/TaoBench
dataset· 84 dl
84 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Logic, programming, and type systems · Polynomial and algebraic computation