Rethinking Circuit Completeness in Language Models: AND, OR, and ADDER Gates
Hang Chen, Jiaying Zhu, Xinyu Yang, Wenya Wang

TL;DR
This paper introduces a framework for more complete and faithful circuit discovery in language models by systematically analyzing AND, OR, and ADDER gates, improving interpretability and revealing fundamental properties of model circuits.
Contribution
It proposes a novel framework combining noising and denoising interventions to accurately identify logic gates, addressing incompleteness issues in existing methods.
Findings
Framework effectively restores circuit faithfulness and completeness.
Uncovered proportions and contributions of logic gates in language models.
Demonstrated improved interpretability and understanding of model mechanisms.
Abstract
Circuit discovery has gradually become one of the prominent methods for mechanistic interpretability, and research on circuit completeness has also garnered increasing attention. Methods of circuit discovery that do not guarantee completeness not only result in circuits that are not fixed across different runs but also cause key mechanisms to be omitted. The nature of incompleteness arises from the presence of OR gates within the circuit, which are often only partially detected in standard circuit discovery methods. To this end, we systematically introduce three types of logic gates: AND, OR, and ADDER gates, and decompose the circuit into combinations of these logical gates. Through the concept of these gates, we derive the minimum requirements necessary to achieve faithfulness and completeness. Furthermore, we propose a framework that combines noising-based and denoising-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
