Linear representations in language models can change dramatically over a conversation
Andrew Kyle Lampinen, Yuxuan Li, Eghbal Hosseini, Sangnie Bhardwaj, Murray Shanahan

TL;DR
This paper investigates how linear representations in language models evolve during conversations, revealing significant content-dependent changes that challenge static interpretability and suggest dynamic adaptation mechanisms.
Contribution
It demonstrates that language model representations can change dramatically over a conversation, highlighting the importance of considering temporal dynamics in interpretability.
Findings
Representations of factual information can switch from factual to non-factual within a conversation.
Representation changes are content-dependent and occur across different models and layers.
Replaying conversations or using different models can produce similar representational dynamics.
Abstract
Language model representations often contain linear directions that correspond to high-level concepts. Here, we study the dynamics of these representations: how representations evolve along these dimensions within the context of (simulated) conversations. We find that linear representations can change dramatically over a conversation; for example, information that is represented as factual at the beginning of a conversation can be represented as non-factual at the end and vice versa. These changes are content-dependent; while representations of conversation-relevant information may change, generic information is generally preserved. These changes are robust even for dimensions that disentangle factuality from more superficial response patterns, and occur across different model families and layers of the model. These representation changes do not require on-policy conversations; even…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Opinion Dynamics and Social Influence · Topic Modeling
