Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable Guarantees
Itamar Hadad, Guy Katz, Shahaf Bassan

TL;DR
This paper introduces automated circuit discovery methods with provable guarantees for neural networks, ensuring robustness, alignment, and minimality, supported by theoretical insights and empirical validation on vision models.
Contribution
It presents novel algorithms for circuit discovery that provide formal guarantees, bridging verification techniques with interpretability and uncovering new theoretical connections.
Findings
Circuits with stronger robustness guarantees than existing methods.
Algorithms achieve provable input domain robustness and patching guarantees.
Experimental results demonstrate improved circuit quality on vision models.
Abstract
*Automated circuit discovery* is a central tool in mechanistic interpretability for identifying the internal components of neural networks responsible for specific behaviors. While prior methods have made significant progress, they typically depend on heuristics or approximations and do not offer provable guarantees over continuous input domains for the resulting circuits. In this work, we leverage recent advances in neural network verification to propose a suite of automated algorithms that yield circuits with *provable guarantees*. We focus on three types of guarantees: (1) *input domain robustness*, ensuring the circuit agrees with the model across a continuous input region; (2) *robust patching*, certifying circuit alignment under continuous patching perturbations; and (3) *minimality*, formalizing and capturing a wide array of various notions of succinctness. Interestingly, we…
Peer Reviews
Decision·ICLR 2026 Poster
- The core contribution (applying formal verification to automated circuit discovery) is novel and significant. It directly addresses a fundamental and critical limitation in the field of MI (moving from heuristic approximations to provable guarantees), which has been acknowledged in many recent works in the literature. This work makes a significant step towards increasing the reliability of MI methods and is much-needed. - The paper is technically excellent. The formalizations are clear, and th
- The most significant weakness (which the authors do acknowledge) is the scalability of the neural network verification methods. The experiments are conducted on relatively small vision models (e.g. ResNet2b) yet already require computation time in the minutes to hour range. Current MI research largely focuses on large LLMs or other models using the Transformers architecture, and as proposed, the authors' framework would likely be computationally intractable for such models. This is an inherent
- Problem addressed is impactful and the authors identified clear blind spot in literature - Thorough theoretical contribution - Experiments rely on VNN-COMP community-benchmark. Results show authors method strongly outperforms the chosen baselines, across all experiments. - Paper is clear and flows well. Authors contribution also clear. - Literature review carried out with diligence. I am not a domain expert - fellow reviewers with greater expertise may have identified gaps.
- Some neural network verification concepts could have been introduced more in details (e.g. "Patching") - Lack of running example make reading hard to instantiate into a real-world, impactful use case.
1. Clear definitions and proofs for robustness and minimality in circuit discovery. 2. The greedy and hitting-set approaches are well motivated and practically implementable. 3. Connects mechanistic interpretability with formal verification through provable guarantees.
1. High computational cost: Verification with α–β-CROWN for each candidate circuit is extremely slow; scalability remains a bottleneck. 2. Small experimental scope: Only small convolutional and MLP networks are tested; no evidence on larger or modern architectures. 3. Limited interpretive analysis: The paper emphasizes correctness and robustness, but offers little discussion of what the discovered circuits mean semantically. 4. Evaluation imbalance: Most comparisons are against heuristic base
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Physical Unclonable Functions (PUFs) and Hardware Security · Explainable Artificial Intelligence (XAI)
