Converging to a Lingua Franca: Evolution of Linguistic Regions and Semantics Alignment in Multilingual Large Language Models
Hongchuan Zeng, Senyu Han, Lu Chen, Kai Yu

TL;DR
This paper investigates how multilingual large language models develop internal linguistic regions and semantic alignment, revealing a common semantic space that improves cross-lingual understanding as models grow larger and train longer.
Contribution
It uncovers the existence of a shared semantic latent space in LLMs and details how linguistic regions evolve during training and scaling, enhancing cross-lingual capabilities.
Findings
Neuron activation patterns are similar for the same language.
Semantic similarity leads to similar activation patterns across languages.
Linguistic neurons are concentrated in early and late layers, becoming denser with training.
Abstract
Large language models (LLMs) have demonstrated remarkable performance, particularly in multilingual contexts. While recent studies suggest that LLMs can transfer skills learned in one language to others, the internal mechanisms behind this ability remain unclear. We observed that the neuron activation patterns of LLMs exhibit similarities when processing the same language, revealing the existence and location of key linguistic regions. Additionally, we found that neuron activation patterns are similar when processing sentences with the same semantic meaning in different languages. This indicates that LLMs map semantically identical inputs from different languages into a "Lingua Franca", a common semantic latent space that allows for consistent processing across languages. This semantic alignment becomes more pronounced with training and increased model size, resulting in a more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Second Language Learning and Teaching · Multilingual Education and Policy
MethodsBLOOM
