How Language Directions Align with Token Geometry in Multilingual LLMs

JaeSeong Kim; Suan Lee

arXiv:2511.16693·cs.CL·November 24, 2025

How Language Directions Align with Token Geometry in Multilingual LLMs

JaeSeong Kim, Suan Lee

PDF

Open Access

TL;DR

This paper systematically analyzes how multilingual large language models encode language information across their layers, revealing that language separation occurs early and is influenced by training data composition, with implications for fairness and data strategies.

Contribution

It introduces a comprehensive probing framework and a new Token–Language Alignment analysis to understand language encoding in multilingual LLMs, highlighting the impact of training data on language representation.

Findings

01

Language information becomes sharply separated in early layers.

02

Language directions are strongly aligned with training data composition.

03

Chinese-inclusive models show higher language alignment than English-centric models.

Abstract

Multilingual LLMs demonstrate strong performance across diverse languages, yet there has been limited systematic analysis of how language information is structured within their internal representation space and how it emerges across layers. We conduct a comprehensive probing study on six multilingual LLMs, covering all 268 transformer layers, using linear and nonlinear probes together with a new Token--Language Alignment analysis to quantify the layer-wise dynamics and geometric structure of language encoding. Our results show that language information becomes sharply separated in the first transformer block (+76.4 $\pm$ 8.2 percentage points from Layer 0 to 1) and remains almost fully linearly separable throughout model depth. We further find that the alignment between language directions and vocabulary embeddings is strongly tied to the language composition of the training data.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Graph Neural Networks