MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs
Viresh Pati, Zhengyu Li, Piyush Jha, Rahul Garg, Yatharth Sejpal, Vijay Ganesh

TL;DR
MathConstraint is a new adaptive benchmark for testing the combinatorial reasoning of large language models, combining constraint satisfaction problems with solver-based verification to generate challenging, verifiable instances.
Contribution
It introduces a scalable, parameterized benchmark generator that creates difficult, automatically verifiable combinatorial reasoning problems for LLM evaluation.
Findings
Frontier models achieve 72.6% to 87.6% accuracy on easy instances.
Model accuracy drops to 18.5% to 66.9% on the main benchmark.
Tool access significantly improves model performance, with up to 52 percentage points gain.
Abstract
We introduce MathConstraint, a hard, adaptive benchmark for evaluating the combinatorial reasoning capabilities of LLMs. We combine constraint satisfaction problems with rigorous solver-based verification and design an adaptive generator to create instances that remain challenging as the LLMs improve in their reasoning capabilities. Unlike existing benchmarks that quickly saturate on fixed datasets or use LLM-as-a-judge for checking solutions,MathConstraint uses parameterized problem types that enable scalable generation of arbitrarily difficult and automatically verifiable instances. We release MathConstraint-Easy ( instances), on which frontier models achieve between (gemini-3.1-flash-lite) and (gpt-5.5) accuracy, and MathConstraint ( instances) on which the same models drop to between (claude-4.6-sonnet) and (gpt-5.5) accuracy,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
