Fine-Grained Analysis of Shared Syntactic Mechanisms in Language Models
Ryoma Kumon, Hitomi Yanaka

TL;DR
This paper explores whether language models share neural mechanisms across different syntactic constructions, revealing localized shared mechanisms for filler-gap dependencies and the robustness of interpretability methods.
Contribution
It introduces a granular causal interpretability approach to identify shared syntactic mechanisms in language models, validated through manipulation experiments.
Findings
Shared filler-gap mechanisms are localized in early to middle layers.
NPI processing does not exhibit a unified shared mechanism.
Identified mechanisms generalize to out-of-distribution data.
Abstract
While language models demonstrate sophisticated syntactic capabilities, the extent to which their internal mechanisms align with cross-constructional principles studied in linguistics remains poorly understood. This study investigates whether models employ shared neural mechanisms across different syntactic constructions by applying causal interpretability methods at a granular level. Focusing on filler-gap dependencies and negative polarity item (NPI) licensing, we utilize activation patching to identify the functional roles of specific attention heads and MLP blocks. Our results reveal a highly localized and shared mechanism for filler-gap dependencies located in the early to middle layers, whereas NPI processing exhibits no such unified mechanism. Furthermore, we find that these mechanisms identified by activation patching generalize to out-of-distribution, while distributed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
