LeanCat: A Benchmark Suite for Formal Category Theory in Lean (Part I: 1-Categories)

Rongge Xu; Hui Dai; Yiming Fu; Jiedong Jiang; Tianjiao Nie; Junkai Wang; Holiverse Yang; Zhi-Hao Zhang

arXiv:2512.24796·cs.LO·February 27, 2026

LeanCat: A Benchmark Suite for Formal Category Theory in Lean (Part I: 1-Categories)

Rongge Xu, Hui Dai, Yiming Fu, Jiedong Jiang, Tianjiao Nie, Junkai Wang, Holiverse Yang, Zhi-Hao Zhang

PDF

Open Access 1 Datasets

TL;DR

This paper introduces LeanCat, a challenging benchmark suite for formal category theory in Lean, revealing significant gaps in current models' abstraction capabilities and demonstrating the effectiveness of retrieval-augmented reasoning methods.

Contribution

The paper presents LeanCat, a new benchmark for formal category theory tasks, and evaluates retrieval-augmented models, showing improved performance and highlighting the importance of iterative reasoning.

Findings

01

Current models solve only 12% of tasks at pass@4.

02

Performance drops from 55% on Easy to 0% on High-difficulty tasks.

03

Retrieval-augmented LeanBridge doubles success rate to 24%.

Abstract

While large language models (LLMs) have demonstrated impressive capabilities in formal theorem proving, current benchmarks fail to adequately measure library-grounded abstraction -- the ability to reason with high-level interfaces and reusable structures central to modern mathematics and software engineering. We introduce LeanCat, a challenging benchmark comprising 100 fully formalized category-theory tasks in Lean. Unlike algebra or arithmetic, category theory serves as a rigorous stress test for structural, interface-level reasoning. Our evaluation reveals a severe abstraction gap: the best state-of-the-art model solves only 12.0% of tasks at pass@4, with performance collapsing from 55.0% on Easy tasks to 0.0% on High-difficulty tasks, highlighting a failure in compositional generalization. To overcome this, we evaluate LeanBridge, a retrieval-augmented agent that employs a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

iiis-lean/lean-math-formal-corpus
dataset· 314 dl
314 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Mathematics, Computing, and Information Processing · Model-Driven Software Engineering Techniques