CIRCUS: Circuit Consensus under Uncertainty via Stability Ensembles
Swapnil Parekh

TL;DR
CIRCUS introduces a method to identify robust circuit structures by analyzing multiple configurations, effectively distinguishing true features from threshold artifacts with minimal computational overhead.
Contribution
It presents a novel ensemble-based approach for circuit discovery that quantifies edge robustness and extracts a consensus circuit, improving interpretability and reliability.
Findings
Consensus circuits are 40x smaller than union of configurations.
Retain comparable explanatory power to larger unions.
Outperform influence-ranked and random baselines.
Abstract
Every mechanistic circuit carries an invisible asterisk: it reflects not just the model's computation, but the analyst's choice of pruning threshold. Change that choice and the circuit changes, yet current practice treats a single pruned subgraph as ground truth with no way to distinguish robust structure from threshold artifacts. We introduce CIRCUS, which reframes circuit discovery as a problem of uncertainty over explanations. CIRCUS prunes one attribution graph under B configurations, assigns each edge an empirical inclusion frequency s(e) in [0,1] measuring how robustly it survives across the configuration family, and extracts a consensus circuit of edges present in every view. This yields a principled core/contingent/noise decomposition (analogous to posterior model-inclusion indicators in Bayesian variable selection) that separates robust…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Adversarial Robustness in Machine Learning · Machine Learning in Materials Science
