LinguaMap: Which Layers of LLMs Speak Your Language and How to Tune Them?
J. Ben Tamo, Daniel Carlander-Reuterfelt, Jonathan Rubin, Dezhi Hong, Mingxian Wang, Oleg Poliannikov

TL;DR
This paper investigates how different layers of large multilingual language models contribute to language control and proposes a selective fine-tuning method that significantly improves language consistency with minimal parameter updates.
Contribution
It introduces a novel interpretability analysis revealing the internal structure of LLMs and presents a layer-specific fine-tuning approach for efficient multilingual adaptation.
Findings
Achieves over 98% language consistency with minimal fine-tuning
Identifies three internal processing phases in LLMs
Layer-specific tuning matches full fine-tuning performance
Abstract
Despite multilingual pretraining, large language models often struggle with non-English tasks, particularly in language control, the ability to respond in the intended language. We identify and characterize two key failure modes: the multilingual transfer bottleneck (correct language, incorrect task response) and the language consistency bottleneck (correct task response, wrong language). To systematically surface these issues, we design a four-scenario evaluation protocol spanning MMLU, MGSM, and XQuAD benchmarks. To probe these issues with interpretability, we extend logit lens analysis to track language probabilities layer by layer and compute cross-lingual semantic similarity of hidden states. The results reveal a three-phase internal structure: early layers align inputs into a shared semantic space, middle layers perform task reasoning, and late layers drive language-specific…
Peer Reviews
Decision·ICLR 2026 Poster
- Novel diagnostic framework for multilingual failure modes: The four-scenario prompting setup provides a well-structured and reproducible way to disentangle language control from task accuracy. - Insightful interpretability analysis: The paper convincingly demonstrates a three-phase structure across layers, linking representational alignment to functional behavior in multilingual settings. - Strong empirical improvements with minimal compute cost: Selective fine-tuning significantly enhances
- Limited novelty in fine-tuning method: While the interpretability analysis is insightful, the proposed selective tuning strategy builds on well-established parameter-efficient fine-tuning concepts and is not fundamentally new. - Narrow evaluation scope: The study focuses on only two models (Qwen-3-32B and BLOOM-7.1B) and a limited set of languages. Broader coverage across typologically diverse languages or other architectures would strengthen generalization claims. - No comparison with alter
The paper makes a strong interpretability-driven contribution, linking layer-wise representational dynamics to achieve multilingual control. The integration of logit lens analysis with hidden-state similarity profiling provides a compelling explanation for language drift. The selective fine-tuning strategy seems intuitive and computationally efficient, and demonstrates that language-specific control can be restored without retraining the full model.
While the results are compelling, several aspects of the methodology needs further clarification. The precise criterion for identifying layer boundaries (e.g., layer 55 for Qwen-3-32B) is not fully justified. This raises uncertainty about whether these thresholds are architecture-specific or emergent from model dynamics. The mean-pooled cosine similarity metric may obscure finer token-level divergences, leaving open how exactly semantic alignment transitions into language control. Similarly, Blo
- Improving performance and control of multilingual LLMs is an important topic, especially for ensuring that all users of models, regardless of language, have an equivalent experience. - The approach offers a method that avoids compute-expensive fine-tuning. The results suggest that only 5% or fewer parameters need to be tuned.
- The paper considers only two models from different families, and these are not of comparable size. Specifically the Qwen model is 32B parameters whilst the Bloom model is 7B. - There are statements making assumptions about architectures (e.g., architectures like Qwen favor task success) but it is difficult to know if these statements are generally true with the question about the model size differences. Also, if numbers are reported from a single run then again we do not know if the observati
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
