CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics

Junqi Liu; Xiaohan Lin; Jonas Bayer; Yael Dillies; Weijie Jiang,; Xiaodan Liang; Roman Soletskyi; Haiming Wang; Yunzhou Xie; Beibei Xiong,; Zhengfeng Yang; Jujian Zhang; Lihong Zhi; Jia Li; Zhengying Liu

arXiv:2505.03171·cs.AI·May 7, 2025

CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics

Junqi Liu, Xiaohan Lin, Jonas Bayer, Yael Dillies, Weijie Jiang,, Xiaodan Liang, Roman Soletskyi, Haiming Wang, Yunzhou Xie, Beibei Xiong,, Zhengfeng Yang, Jujian Zhang, Lihong Zhi, Jia Li, Zhengying Liu

PDF

Open Access 1 Repo 1 Datasets

TL;DR

CombiBench is a new benchmark for evaluating large language models on combinatorial mathematics problems, featuring formalized problems in Lean4 and a standardized evaluation framework, revealing current models' limited capabilities in this domain.

Contribution

The paper introduces CombiBench, a comprehensive combinatorial problem benchmark with formalization in Lean4 and a novel evaluation framework, filling a gap in formal mathematics benchmarking.

Findings

01

Kimina-Prover solved 7 out of 100 problems.

02

LLMs show limited capability in formal combinatorial problem solving.

03

Benchmark and evaluation tools are open-sourced for future research.

Abstract

Neurosymbolic approaches integrating large language models with formal reasoning have recently achieved human-level performance on mathematics competition problems in algebra, geometry and number theory. In comparison, combinatorics remains a challenging domain, characterized by a lack of appropriate benchmarks and theorem libraries. To address this gap, we introduce CombiBench, a comprehensive benchmark comprising 100 combinatorial problems, each formalized in Lean~4 and paired with its corresponding informal statement. The problem set covers a wide spectrum of difficulty levels, ranging from middle school to IMO and university level, and span over ten combinatorial topics. CombiBench is suitable for testing IMO solving capabilities since it includes all IMO combinatorial problems since 2000 (except IMO 2004 P3 as its statement contain an images). Furthermore, we provide a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

moonshotai/combibench
noneOfficial

Datasets

iiis-lean/lean-math-formal-corpus
dataset· 314 dl
314 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing

MethodsSparse Evolutionary Training