TL;DR
This paper introduces improved methods for circuit discovery in mechanistic interpretability, using bootstrapping, ratio-based edge selection, and ILP to enhance circuit faithfulness across models.
Contribution
It presents three novel techniques—bootstrapping, ratio-based edge prioritization, and ILP formulation—that improve circuit discovery in the MIB benchmark.
Findings
More faithful circuits achieved
Outperforms prior approaches on multiple tasks
Code available for reproducibility
Abstract
One of the main challenges in mechanistic interpretability is circuit discovery, determining which parts of a model perform a given task. We build on the Mechanistic Interpretability Benchmark (MIB) and propose three key improvements to circuit discovery. First, we use bootstrapping to identify edges with consistent attribution scores. Second, we introduce a simple ratio-based selection strategy to prioritize strong positive-scoring edges, balancing performance and faithfulness. Third, we replace the standard greedy selection with an integer linear programming formulation. Our methods yield more faithful circuits and outperform prior approaches across multiple MIB tasks and models. Our code is available at: https://github.com/technion-cs-nlp/MIB-Shared-Task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
