Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models
Ercong Nie, Helmut Schmid, Hinrich Sch\"utze

TL;DR
This paper investigates the internal mechanisms behind language confusion in English-centric large language models and proposes neuron-level interventions to reduce unintended language switches, improving multilingual performance.
Contribution
It introduces the first mechanistic interpretability study of language confusion, combining behavioral benchmarks with neuron analysis, and demonstrates effective neuron editing for mitigation.
Findings
Confusion points are central to language switching behavior.
Layer-wise analysis reveals final layer transition failures cause confusion.
Neuron editing significantly reduces language confusion while maintaining model performance.
Abstract
Language confusion -- where large language models (LLMs) generate unintended languages against the user's need -- remains a critical challenge, especially for English-centric models. We present the first mechanistic interpretability (MI) study of language confusion, combining behavioral benchmarking with neuron-level analysis. Using the Language Confusion Benchmark (LCB), we show that confusion points (CPs) -- specific positions where language switches occur -- are central to this phenomenon. Through layer-wise analysis with TunedLens and targeted neuron attribution, we reveal that transition failures in the final layers drive confusion. We further demonstrate that editing a small set of critical neurons, identified via comparative analysis with a multilingual-tuned counterpart, substantially mitigates confusion while largely preserving general competence and fluency. Our approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications
MethodsSparse Evolutionary Training
