Patches of Nonlinearity: Instruction Vectors in Large Language Models
Irina Bigoulaeva, Jonas Rohweder, Subhabrata Dutta, Iryna Gurevych

TL;DR
This paper investigates how instruction-specific representations, called Instruction Vectors, are formed and used in large language models, revealing their localized nature and complex non-linear interactions across model layers.
Contribution
It introduces a novel method to localize information processing in language models and uncovers the dual linear and non-linear properties of instruction representations.
Findings
Instruction Vectors are localized in models.
IVs exhibit linear separability and non-linear causal interactions.
Different information pathways are activated in later layers based on early task representations.
Abstract
Despite the recent success of instruction-tuned language models and their ubiquitous usage, very little is known of how models process instructions internally. In this work, we address this gap from a mechanistic point of view by investigating how instruction-specific representations are constructed and utilized in different stages of post-training: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Via causal mediation, we identify that instruction representation is fairly localized in models. These representations, which we call Instruction Vectors (IVs), demonstrate a curious juxtaposition of linear separability along with non-linear causal interaction, broadly questioning the scope of the linear representation hypothesis commonplace in mechanistic interpretability. To disentangle the non-linear causal interaction, we propose a novel method to localize information…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
