Feature Interactions Reveal Linguistic Structure in Language Models
Jaap Jumelet, Willem Zuidema

TL;DR
This paper investigates feature interactions in language models to understand their role in capturing linguistic structure, evaluating various attribution methods through formal language tasks and real-world case studies.
Contribution
It introduces a grey box methodology using PCFGs to assess feature interaction attribution methods and demonstrates their effectiveness in revealing linguistic structures in language models.
Findings
Some attribution methods can uncover grammatical rules in formal language tasks.
Evaluation on language models provides new insights into their acquired linguistic structures.
Certain configurations improve the faithfulness of interaction attributions.
Abstract
We study feature interactions in the context of feature attribution methods for post-hoc interpretability. In interpretability research, getting to grips with feature interactions is increasingly recognised as an important challenge, because interacting features are key to the success of neural networks. Feature interactions allow a model to build up hierarchical representations for its input, and might provide an ideal starting point for the investigation into linguistic structure in language models. However, uncovering the exact role that these interactions play is also difficult, and a diverse range of interaction attribution methods has been proposed. In this paper, we focus on the question which of these methods most faithfully reflects the inner workings of the target models. We work out a grey box methodology, in which we train models to perfection on a formal language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)
MethodsFocus
