Certified Circuits: Stability Guarantees for Mechanistic Circuits
Alaa Anani, Tobias Lorenz, Bernt Schiele, Mario Fritz, Jonas Fischer

TL;DR
This paper introduces Certified Circuits, a method that guarantees the stability of neural circuit discovery against dataset perturbations, resulting in more reliable and compact interpretability explanations.
Contribution
It provides a formal framework for stable circuit discovery using randomized data subsampling, improving robustness and interpretability of neural network explanations.
Findings
Achieves up to 91% higher accuracy on ImageNet and OOD datasets.
Uses 45% fewer neurons in the resulting circuits.
Circuits remain reliable where baseline methods fail.
Abstract
Understanding how neural networks arrive at their predictions is essential for debugging, auditing, and deployment. Mechanistic interpretability pursues this goal by identifying circuits - minimal subnetworks responsible for specific behaviors. However, existing circuit discovery methods are brittle: circuits depend strongly on the chosen concept dataset and often fail to transfer out-of-distribution, raising doubts whether they capture concept or dataset-specific artifacts. We introduce Certified Circuits, which provide provable stability guarantees for circuit discovery. Our framework wraps any black-box discovery algorithm with randomized data subsampling to certify that circuit component inclusion decisions are invariant to bounded edit-distance perturbations of the concept dataset. Unstable neurons are abstained from, yielding circuits that are more compact and more accurate. On…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Advanced Neural Network Applications · Adversarial Robustness in Machine Learning
