Multilingual Language Models Encode Script Over Linguistic Structure

Aastha A K Verma; Anwoy Chatterjee; Mehak Gupta; Tanmoy Chakraborty

arXiv:2604.05090·cs.CL·April 22, 2026

Multilingual Language Models Encode Script Over Linguistic Structure

Aastha A K Verma, Anwoy Chatterjee, Mehak Gupta, Tanmoy Chakraborty

PDF

TL;DR

This study reveals that multilingual language models primarily organize their internal representations based on orthographic surface forms rather than abstract linguistic structures, with abstraction emerging gradually in deeper layers.

Contribution

The paper introduces a novel analysis of multilingual LMs showing their representations are strongly influenced by orthography, and typological abstraction develops progressively across layers.

Findings

01

Orthography strongly conditions language-associated units.

02

Romanization causes near-disjoint representations across scripts.

03

Deeper layers reveal more accessible typological structures.

Abstract

Multilingual language models (LMs) organize representations for typologically and orthographically diverse languages into a shared parameter space, yet the nature of this internal organization remains elusive. In this work, we investigate which linguistic properties - abstract language identity or surface-form cues - shape multilingual representations. To do so, we analyze language-associated units across different model families and scales using the Language Activation Probability Entropy (LAPE) metric, and further decompose activations with Sparse Autoencoders. We find that these units are strongly conditioned on orthography: romanization induces near-disjoint representations that align with neither native-script inputs nor English, while word-order shuffling has limited effect on unit identity. Probing shows that typological structure becomes increasingly accessible in deeper layers,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.