Vocabulary shapes cross-lingual variation of word-order learnability in language models

Jonas Mayer Martins; Jaap Jumelet; Viola Priesemann; Lisa Beinborn

arXiv:2603.19427·cs.CL·March 23, 2026

Vocabulary shapes cross-lingual variation of word-order learnability in language models

Jonas Mayer Martins, Jaap Jumelet, Viola Priesemann, Lisa Beinborn

PDF

Open Access

TL;DR

This study investigates how vocabulary structure influences the learnability of word order in language models across different languages, revealing vocabulary as a key factor in cross-lingual variation.

Contribution

It demonstrates that vocabulary structure, rather than language typology, predicts word-order learnability in transformer models across languages.

Findings

01

Vocabulary structure predicts surprisal across languages.

02

Irregular word order increases model surprisal.

03

Vocabulary features outweigh language typology in learnability.

Abstract

Why do some languages like Czech permit free word order, while others like English do not? We address this question by pretraining transformer language models on a spectrum of synthetic word-order variants of natural languages. We observe that greater word-order irregularity consistently raises model surprisal, indicating reduced learnability. Sentence reversal, however, affects learnability only weakly. A coarse distinction of free- (e.g., Czech and Finnish) and fixed-word-order languages (e.g., English and French) does not explain cross-lingual variation. Instead, the structure of the word and subword vocabulary strongly predicts the model surprisal. Overall, vocabulary structure emerges as a key driver of computational word-order learnability across languages.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage Development and Disorders · Language and cultural evolution · Neurobiology of Language and Bilingualism