TL;DR
This paper investigates how large language models encode morphosyntactic concepts across diverse languages, revealing shared representations and demonstrating their causal role in multilingual tasks like translation.
Contribution
It introduces a method to identify and manipulate shared grammatical features in LLMs, showing these features are robust and cross-lingually consistent.
Findings
Shared grammatical features are encoded in feature directions across languages.
Ablating these features reduces multilingual classifier performance.
Modifying features can alter model behavior in translation tasks.
Abstract
Human bilinguals often use similar brain regions to process multiple languages, depending on when they learned their second language and their proficiency. In large language models (LLMs), how are multiple languages learned and encoded? In this work, we explore the extent to which LLMs share representations of morphsyntactic concepts such as grammatical number, gender, and tense across languages. We train sparse autoencoders on Llama-3-8B and Aya-23-8B, and demonstrate that abstract grammatical concepts are often encoded in feature directions shared across many languages. We use causal interventions to verify the multilingual nature of these representations; specifically, we show that ablating only multilingual features decreases classifier performance to near-chance across languages. We then use these features to precisely modify model behavior in a machine translation task; this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
