MSC-180: A Benchmark for Automated Formal Theorem Proving from Mathematical Subject Classification
Sirui Li, Wangyue Lu, Xiaorui Shi, Ke Weng, Haozhe Sun, Minghe Yu, Tiancheng Zhang, Ge Yu, Hengyu Liu, Lun Du

TL;DR
MSC-180 is a comprehensive benchmark with 180 formal problems across 60 mathematical domains, designed to evaluate and improve the reasoning capabilities of AI theorem provers beyond pattern matching.
Contribution
This paper introduces MSC-180, a new benchmark with expert-verified problems across diverse domains, and proposes the coefficient of variation as a metric to assess model generalization and domain bias.
Findings
State-of-the-art LLM provers achieve only 18.89% pass rate.
Models show significant domain bias with max coverage of 41.7%.
High variability in performance indicates reliance on pattern matching.
Abstract
Automated Theorem Proving (ATP) represents a core research direction in artificial intelligence for achieving formal reasoning and verification, playing a significant role in advancing machine intelligence. However, current large language model (LLM)-based theorem provers suffer from limitations such as restricted domain coverage and weak generalization in mathematical reasoning. To address these issues, we propose MSC-180, a benchmark for evaluation based on the MSC2020 mathematical subject classification. It comprises 180 formal verification problems, 3 advanced problems from each of 60 mathematical branches, spanning from undergraduate to graduate levels. Each problem has undergone multiple rounds of verification and refinement by domain experts to ensure formal accuracy. Evaluations of state-of-the-art LLM-based theorem provers under the pass@32 setting reveal that the best model…
Peer Reviews
Decision·Submitted to ICLR 2026
- Valuable and well-curated dataset. - The introduction of the coefficient of variation as a cross-domain stability metric is a thoughtful and useful addition.
- The writing needs quite some polishing (e.g., missing a stop in line 319, 323, 336, 426, 431, 435, 440, 444, 449). - Lack of in-depth comparison with previous dataset like ProofNet, FormalMATH, and FATE [1]. - The data processing process is not well illustrated. For example, how is the semantic verification being carried using the DeepSeek API? What are the exact criteria for human verification and selection? [1] Shen, Ziju, et al. "REAL-Prover: Retrieval Augmented Lean Prover for Mathematic
1. The dataset is curated and verified carefully by domain experts, ensuring high quality and formal soundness across 60 domains. 2. There is clear problem taxonomy, reproducible setup, and well-defined evaluation metrics (pass@k, Domain@k, CV) make the study credible. 3. Balanced interpretation: The discussion doesn't overclaim; it clearly states both the models’ current limitations and partial strengths (e.g., structured reasoning in applied math). 4. The paper is logically structured, prec
1. The work stops at benchmarking, without concrete methods proposed to improve reasoning, this is a significant limitation. 2. Limited diversity of evaluated models. While several provers are tested, it would be useful to include symbolic or neuro-symbolic baselines to contextualize LLM gaps. 3. It remains unclear how representative these 180 problems are of real downstream theorem-proving workloads in Lean, Coq, or Isabelle. 4. Some details can be further justified, such as: is there any st
1. The MSC-180 dataset proposed in the paper claims to cover 60 major categories of mathematics, which encompasses a broader range of topics compared to commonly used ATP datasets such as miniF2F. This broader coverage could have potential value for evaluating ATP performance across a wider variety of mathematical topics. 2. The paper introduces the use of the coefficient of variation as a metric to assess ATP performance across different mathematical topics. This provides a meaningful referenc
1. The dataset is far too small, which significantly undermines its practical value. With only three problems per mathematical category, evaluations of provers’ proving ability and generalization become highly susceptible to randomness; both the variance and mean of scores will be unstable and may lack statistical significance. Having only three problems also fails to adequately cover the breadth and depth of domain knowledge for each category, so inferring a prover’s overall capability in a fie
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Machine Learning in Materials Science · Topic Modeling
