MSC-180: A Benchmark for Automated Formal Theorem Proving from Mathematical Subject Classification

Sirui Li; Wangyue Lu; Xiaorui Shi; Ke Weng; Haozhe Sun; Minghe Yu; Tiancheng Zhang; Ge Yu; Hengyu Liu; Lun Du

arXiv:2512.18256·cs.AI·December 23, 2025

MSC-180: A Benchmark for Automated Formal Theorem Proving from Mathematical Subject Classification

Sirui Li, Wangyue Lu, Xiaorui Shi, Ke Weng, Haozhe Sun, Minghe Yu, Tiancheng Zhang, Ge Yu, Hengyu Liu, Lun Du

PDF

Open Access 3 Reviews

TL;DR

MSC-180 is a comprehensive benchmark with 180 formal problems across 60 mathematical domains, designed to evaluate and improve the reasoning capabilities of AI theorem provers beyond pattern matching.

Contribution

This paper introduces MSC-180, a new benchmark with expert-verified problems across diverse domains, and proposes the coefficient of variation as a metric to assess model generalization and domain bias.

Findings

01

State-of-the-art LLM provers achieve only 18.89% pass rate.

02

Models show significant domain bias with max coverage of 41.7%.

03

High variability in performance indicates reliance on pattern matching.

Abstract

Automated Theorem Proving (ATP) represents a core research direction in artificial intelligence for achieving formal reasoning and verification, playing a significant role in advancing machine intelligence. However, current large language model (LLM)-based theorem provers suffer from limitations such as restricted domain coverage and weak generalization in mathematical reasoning. To address these issues, we propose MSC-180, a benchmark for evaluation based on the MSC2020 mathematical subject classification. It comprises 180 formal verification problems, 3 advanced problems from each of 60 mathematical branches, spanning from undergraduate to graduate levels. Each problem has undergone multiple rounds of verification and refinement by domain experts to ensure formal accuracy. Evaluations of state-of-the-art LLM-based theorem provers under the pass@32 setting reveal that the best model…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

- Valuable and well-curated dataset. - The introduction of the coefficient of variation as a cross-domain stability metric is a thoughtful and useful addition.

Weaknesses

- The writing needs quite some polishing (e.g., missing a stop in line 319, 323, 336, 426, 431, 435, 440, 444, 449). - Lack of in-depth comparison with previous dataset like ProofNet, FormalMATH, and FATE [1]. - The data processing process is not well illustrated. For example, how is the semantic verification being carried using the DeepSeek API? What are the exact criteria for human verification and selection? [1] Shen, Ziju, et al. "REAL-Prover: Retrieval Augmented Lean Prover for Mathematic

Reviewer 02Rating 6Confidence 3

Strengths

1. The dataset is curated and verified carefully by domain experts, ensuring high quality and formal soundness across 60 domains. 2. There is clear problem taxonomy, reproducible setup, and well-defined evaluation metrics (pass@k, Domain@k, CV) make the study credible. 3. Balanced interpretation: The discussion doesn't overclaim; it clearly states both the models’ current limitations and partial strengths (e.g., structured reasoning in applied math). 4. The paper is logically structured, prec

Weaknesses

1. The work stops at benchmarking, without concrete methods proposed to improve reasoning, this is a significant limitation. 2. Limited diversity of evaluated models. While several provers are tested, it would be useful to include symbolic or neuro-symbolic baselines to contextualize LLM gaps. 3. It remains unclear how representative these 180 problems are of real downstream theorem-proving workloads in Lean, Coq, or Isabelle. 4. Some details can be further justified, such as: is there any st

Reviewer 03Rating 2Confidence 4

Strengths

1. The MSC-180 dataset proposed in the paper claims to cover 60 major categories of mathematics, which encompasses a broader range of topics compared to commonly used ATP datasets such as miniF2F. This broader coverage could have potential value for evaluating ATP performance across a wider variety of mathematical topics. 2. The paper introduces the use of the coefficient of variation as a metric to assess ATP performance across different mathematical topics. This provides a meaningful referenc

Weaknesses

1. The dataset is far too small, which significantly undermines its practical value. With only three problems per mathematical category, evaluations of provers’ proving ability and generalization become highly susceptible to randomness; both the variance and mean of scores will be unstable and may lack statistical significance. Having only three problems also fails to adequately cover the breadth and depth of domain knowledge for each category, so inferring a prover’s overall capability in a fie

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Machine Learning in Materials Science · Topic Modeling