Evaluating Robustness of Reasoning Models on Parameterized Logical Problems
Na\"im Es-sebbani, Esteban Marquer, Yakoub Salhi, Zied Bouraoui

TL;DR
This paper introduces a diagnostic benchmark for 2-SAT to evaluate the reasoning robustness of language models, isolating structural factors that influence satisfiability and revealing brittleness under targeted perturbations.
Contribution
It presents a structured, parameterized 2-SAT benchmark that isolates reasoning competencies and failure modes, enabling detailed analysis of model robustness beyond surface-level difficulty.
Findings
Models exhibit sharp performance drops under structural perturbations.
Surface statistics alone do not predict model robustness.
Structural interventions reveal brittleness regimes in reasoning models.
Abstract
Logic provides a controlled testbed for evaluating LLM-based reasoners, yet standard SAT-style benchmarks often conflate surface difficulty (length, wording, clause order) with the structural phenomena that actually determine satisfiability. We introduce a diagnostic benchmark for 2-SAT built from parameterized families of structured 2--CNF formulas, where satisfiability is characterized by the implication graph and can be tuned along interpretable axes. Our generators isolate distinct competencies and failure modes: (i) contradiction-cycle UNSAT cores with controllable size and imbalance, (ii) SAT instances with a prescribed fraction of free variables to control solution multiplicity, (iii) planted backbones that modulate propagation, (iv) late bridge clauses that couple otherwise monotone regions to probe sensitivity to ordering and revision, and (v) symmetry/duplication variants that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFormal Methods in Verification · Constraint Satisfaction and Optimization · Logic, Reasoning, and Knowledge
