MathlibPR: Pull Request Merge-Readiness Benchmark for Formal Mathematical Libraries

Zixuan Xie; Xinyu Liu; Shangtong Zhang

arXiv:2605.07147·cs.LO·May 14, 2026

MathlibPR: Pull Request Merge-Readiness Benchmark for Formal Mathematical Libraries

Zixuan Xie, Xinyu Liu, Shangtong Zhang

PDF

TL;DR

MathlibPR is a benchmark designed to evaluate the ability of large language models to assess the merge-readiness of pull requests in the Mathlib formal mathematical library, addressing a bottleneck in the review process.

Contribution

The paper introduces MathlibPR, a new benchmark for LLMs to evaluate and assist in reviewing Mathlib PRs, highlighting current challenges and potential for automation.

Findings

01

LLMs struggle to distinguish merge-ready PRs from others

02

MathlibPR provides a supervised signal for training reviewer assistants

03

Evaluation of several LLMs shows limited success in PR assessment

Abstract

The ecosystem of Lean and Mathlib has become the de facto standard for large language model (LLM) assisted formal reasoning with remarkable successes in recent years. Those successes, however, only consume Mathlib as an essential dependency but do not directly contribute to it. In the meantime, the growth of Mathlib has recently been bottlenecked by the review process, which requires human reviewers to judge whether proposed pull requests (PRs) follow the Mathlib's conventions and are worth integrating as part of a shared mathematical infrastructure. This leads to our central question: can LLMs help review Mathlib PRs? To this end, we introduce MathlibPR, a benchmark built from real Mathlib4 PR histories. We further propose a staged evaluation protocol and use it to evaluate both LLM models (e.g., DeepSeek, Qwen, Goedel, and Kimina) and LLM agents (e.g., Codex and Claude Code).…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.