MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

Viresh Pati; Zhengyu Li; Piyush Jha; Rahul Garg; Yatharth Sejpal; Vijay Ganesh

arXiv:2605.08498·cs.LG·May 12, 2026

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

Viresh Pati, Zhengyu Li, Piyush Jha, Rahul Garg, Yatharth Sejpal, Vijay Ganesh

PDF

TL;DR

MathConstraint is a new adaptive benchmark for testing the combinatorial reasoning of large language models, combining constraint satisfaction problems with solver-based verification to generate challenging, verifiable instances.

Contribution

It introduces a scalable, parameterized benchmark generator that creates difficult, automatically verifiable combinatorial reasoning problems for LLM evaluation.

Findings

01

Frontier models achieve 72.6% to 87.6% accuracy on easy instances.

02

Model accuracy drops to 18.5% to 66.9% on the main benchmark.

03

Tool access significantly improves model performance, with up to 52 percentage points gain.

Abstract

We introduce MathConstraint, a hard, adaptive benchmark for evaluating the combinatorial reasoning capabilities of LLMs. We combine constraint satisfaction problems with rigorous solver-based verification and design an adaptive generator to create instances that remain challenging as the LLMs improve in their reasoning capabilities. Unlike existing benchmarks that quickly saturate on fixed datasets or use LLM-as-a-judge for checking solutions,MathConstraint uses parameterized problem types that enable scalable generation of arbitrarily difficult and automatically verifiable instances. We release MathConstraint-Easy ( $266$ instances), on which frontier models achieve between $72.6%$ (gemini-3.1-flash-lite) and $87.6%$ (gpt-5.5) accuracy, and MathConstraint ( $329$ instances) on which the same models drop to between $18.5%$ (claude-4.6-sonnet) and $66.9%$ (gpt-5.5) accuracy,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.