Evaluating Robustness of Reasoning Models on Parameterized Logical Problems

Na\"im Es-sebbani; Esteban Marquer; Yakoub Salhi; Zied Bouraoui

arXiv:2602.12665·cs.AI·February 16, 2026

Evaluating Robustness of Reasoning Models on Parameterized Logical Problems

Na\"im Es-sebbani, Esteban Marquer, Yakoub Salhi, Zied Bouraoui

PDF

Open Access

TL;DR

This paper introduces a diagnostic benchmark for 2-SAT to evaluate the reasoning robustness of language models, isolating structural factors that influence satisfiability and revealing brittleness under targeted perturbations.

Contribution

It presents a structured, parameterized 2-SAT benchmark that isolates reasoning competencies and failure modes, enabling detailed analysis of model robustness beyond surface-level difficulty.

Findings

01

Models exhibit sharp performance drops under structural perturbations.

02

Surface statistics alone do not predict model robustness.

03

Structural interventions reveal brittleness regimes in reasoning models.

Abstract

Logic provides a controlled testbed for evaluating LLM-based reasoners, yet standard SAT-style benchmarks often conflate surface difficulty (length, wording, clause order) with the structural phenomena that actually determine satisfiability. We introduce a diagnostic benchmark for 2-SAT built from parameterized families of structured 2--CNF formulas, where satisfiability is characterized by the implication graph and can be tuned along interpretable axes. Our generators isolate distinct competencies and failure modes: (i) contradiction-cycle UNSAT cores with controllable size and imbalance, (ii) SAT instances with a prescribed fraction of free variables to control solution multiplicity, (iii) planted backbones that modulate propagation, (iv) late bridge clauses that couple otherwise monotone regions to probe sensitivity to ordering and revision, and (v) symmetry/duplication variants that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFormal Methods in Verification · Constraint Satisfaction and Optimization · Logic, Reasoning, and Knowledge