Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers
Cl\'ement Dumas, Chris Wendler, Veniamin Veselovsky, Giovanni Monea, Robert West

TL;DR
This paper investigates whether multilingual language models develop universal, language-independent concept representations by analyzing and manipulating latent activations during translation tasks, revealing language-agnostic concepts.
Contribution
The study introduces activation patching techniques to demonstrate that language models encode language and concepts separately, supporting the existence of universal concept representations.
Findings
Language is encoded earlier than concepts in the model.
Concept and language can be independently altered via activation patching.
Mean concept representations improve translation and enable natural language descriptions.
Abstract
A central question in multilingual language modeling is whether large language models (LLMs) develop a universal concept representation, disentangled from specific languages. In this paper, we address this question by analyzing latent representations (latents) during a word-translation task in transformer-based LLMs. We strategically extract latents from a source translation prompt and insert them into the forward pass on a target translation prompt. By doing so, we find that the output language is encoded in the latent at an earlier layer than the concept to be translated. Building on this insight, we conduct two key experiments. First, we demonstrate that we can change the concept without changing the language and vice versa through activation patching alone. Second, we show that patching with the mean representation of a concept across different languages does not affect the models'…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Neurobiology of Language and Bilingualism · Speech and dialogue systems
MethodsActivation Patching
