SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas

Anjiang Wei; Yuheng Wu; Yingjia Wan; Tarun Suresh; Huanmi Tan; Zhanke Zhou; Sanmi Koyejo; Ke Wang; Alex Aiken

arXiv:2505.14615·cs.AI·September 23, 2025

SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas

Anjiang Wei, Yuheng Wu, Yingjia Wan, Tarun Suresh, Huanmi Tan, Zhanke Zhou, Sanmi Koyejo, Ke Wang, Alex Aiken

PDF

Open Access 1 Datasets 1 Video

TL;DR

SATBench is a new benchmark that evaluates large language models' logical reasoning by automatically generating and validating puzzles from SAT problems, revealing current models' limitations in search-based reasoning tasks.

Contribution

This work introduces SATBench, an automated, scalable benchmark for assessing LLMs' logical reasoning using SAT-derived puzzles, with comprehensive validation and analysis.

Findings

01

LLMs achieve only 65% accuracy on hard SAT puzzles.

02

Systematic failures include bias towards satisfiable instances and context errors.

03

Current LLMs struggle with search-based logical reasoning tasks.

Abstract

We introduce SATBench, a benchmark for evaluating the logical reasoning capabilities of large language models (LLMs) through logical puzzles derived from Boolean satisfiability (SAT) problems. Unlike prior work that focuses on inference rule-based reasoning, which often involves deducing conclusions from a set of premises, our approach leverages the search-based nature of SAT problems, where the objective is to find a solution that fulfills a specified set of logical constraints. Each instance in SATBench is generated from a SAT formula, then translated into a puzzle using LLMs. The generation process is fully automated and allows for adjustable difficulty by varying the number of clauses. All 2100 puzzles are validated through both LLM-based and solver-based consistency checks, with human validation on a subset. Experimental results show that even the strongest model, o4-mini, achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

LLM4Code/SATBench
dataset· 115 dl
115 dl

Videos

SATBench: Benchmarking LLMs’ Logical Reasoning via Automated Puzzle Generation from SAT Formulas· underline

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSparse Evolutionary Training