On the Expressive Power of GNNs for Boolean Satisfiability

Saku Peltonen; Roger Wattenhofer

arXiv:2602.08745·cs.LG·February 10, 2026

On the Expressive Power of GNNs for Boolean Satisfiability

Saku Peltonen, Roger Wattenhofer

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the limitations of Graph Neural Networks in solving Boolean SAT problems, revealing that their expressive power, bounded by the Weisfeiler-Leman test, is insufficient for certain instances, especially industrial ones.

Contribution

It provides a theoretical analysis of GNNs' expressive power for SAT, demonstrating fundamental limitations and exploring the expressivity needed for different SAT instance types.

Findings

01

WL hierarchy cannot distinguish satisfiable from unsatisfiable instances in general.

02

Higher-order WL indistinguishability impacts practical SAT solvers.

03

Industrial instances often require more expressive models to solve.

Abstract

Machine learning approaches to solving Boolean Satisfiability (SAT) aim to replace handcrafted heuristics with learning-based models. Graph Neural Networks have emerged as the main architecture for SAT solving, due to the natural graph representation of Boolean formulas. We analyze the expressive power of GNNs for SAT solving through the lens of the Weisfeiler-Leman (WL) test. As our main result, we prove that the full WL hierarchy cannot, in general, distinguish between satisfiable and unsatisfiable instances. We show that indistinguishability under higher-order WL carries over to practical limitations for WL-bounded solvers that set variables sequentially. We further study the expressivity required for several important families of SAT instances, including regular, random and planar instances. To quantify expressivity needs in practice, we conduct experiments on random instances from…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 5

Strengths

- this paper consolidates many known theoretical results about GNNs and WL test as well as empirical datasets - background knowledge about SAT solving, GNNs, graph isomorphism, and WL tests are carefully illustrated

Weaknesses

- the proposed analysis is not novel, since all reported results are either well-known or immediately implied by previous works. Particularly, Xu et al. (2019) has pointed out that GNNs are bounded by the WL-test, which cannot even solve the graph isomorphism problems that are not as strong as the NP-Complete problems like Boolean satisfiability. - the presented theoretical analysis is largely rephrasing previous known results therefore does not contribute new insights - there are some empirica

Reviewer 02Rating 6Confidence 4

Strengths

The presentation is clear and well-organized with logical progression from simple examples to full results, making complex theoretical concepts accessible while providing complete proofs in appendices. The analysis is nuanced, including both impossibility and positive results (e.g., PlanarSAT separability, random instances). The authors are transparent about the scope of their results, clearly acknowledging that their experiments test only necessary conditions for expressivity and do not demonst

Weaknesses

The paper lacks any trained GNN experiments, making it unclear whether the theoretical limitations observed actually translate into performance gaps in practice. The experiments are purely diagnostic and do not involve any trained GNNs, leaving the practical impact of the theoretical limits untested. Lacks an explanation for why industrial instances require higher expressivity than random ones. The paper does not discuss possible ways to overcome the identified expressivity limits, leaving th

Reviewer 03Rating 6Confidence 3

Strengths

The paper's contribution is mostly theoretical. One of the best aspects is that they actually construct the hard family of instances, which is likely to be useful for further work more than a simple proof of existence. Also, although the paper is about GNNs on paper, most of the results are about the WL kernel's suitability for SAT, which is technically a broader problem. In reality, the paper is about finding the "blind spots" of the WL kernel as a method.

Weaknesses

The empirical study is fairly surface level and misses the mark. We already expect that random instances are easy, contest instances are harder, and hard-by-construction instances are computationally hard. But the main problem is as follows. The power of a GNN is only upper bounded by the power of a WL test. Showing how well the WL test performs on various SAT instances, then, is meaningful only in one direction - if it fails, certainly we don't expect GNNs to succeed, but if it succeeds or perf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFormal Methods in Verification · Constraint Satisfaction and Optimization · Bayesian Modeling and Causal Inference