Phonological distances for linguistic typology and the origin of Indo-European languages
Marius Mavridis, Juan De Gregorio, Raul Toral, David Sanchez

TL;DR
This paper demonstrates that phoneme sequence analysis using information theory can effectively quantify linguistic relatedness, recover language families, and support hypotheses about Indo-European origins.
Contribution
It introduces a novel phonological distance metric based on phoneme dependencies that captures large-scale linguistic relationships and geographic origins.
Findings
Phoneme dependencies encode large-scale linguistic relatedness.
The phonological distance matrix recovers major language families.
A correlation with geographic distance supports the Steppe hypothesis.
Abstract
We show that short-range phoneme dependencies encode large-scale patterns of linguistic relatedness, with direct implications for quantitative typology and evolutionary linguistics. Specifically, using an information-theoretic framework, we argue that phoneme sequences modeled as second-order Markov chains essentially capture the statistical correlations of a phonological system. This finding enables us to quantify distances among 67 modern languages from a multilingual parallel corpus employing a distance metric that incorporates articulatory features of phonemes. The resulting phonological distance matrix recovers major language families and reveals signatures of contact-induced convergence. Remarkably, we obtain a clear correlation with geographic distance, allowing us to constrain a plausible homeland region for the Indo-European family, consistent with the Steppe hypothesis.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
