From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits
Karim Saraipour, Shichang Zhang

TL;DR
This paper investigates how GPT-2 small performs logical reasoning with syllogisms, revealing circuits and binary mechanisms that explain its ability to handle complex logical tasks, advancing mechanistic interpretability.
Contribution
It uncovers specific attention head circuits responsible for syllogistic reasoning and demonstrates high faithfulness in explaining GPT-2's logical capabilities.
Findings
Identified circuits with five attention heads explaining over 90% of performance
Discovered binary mechanisms enabling negation through negative heads
Linked syllogistic reasoning to prior IOI analysis
Abstract
Transformer-based language models (LMs) can perform a wide range of tasks, and mechanistic interpretability (MI) aims to reverse engineer the components responsible for task completion to understand their behavior. Previous MI research has focused on linguistic tasks such as Indirect Object Identification (IOI). In this paper, we investigate the ability of GPT-2 small to handle binary truth values by analyzing its behavior with syllogistic prompts, e.g., "Statement A is true. Statement B matches statement A. Statement B is", which requires more complex logical reasoning compared to IOI. Through our analysis of several syllogism tasks of varying difficulty, we identify multiple circuits that mechanistically explain GPT-2's logical-reasoning capabilities and uncover binary mechanisms that facilitate task completion, including the ability to produce a negated token not present in the input…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
