Neighbors and relatives: How do speech embeddings reflect linguistic connections across the world?
Tuukka T\"or\"o, Antti Suni, Juraj \v{S}imko

TL;DR
This paper explores how speech embeddings derived from machine learning models can reveal linguistic relationships across the world, aligning well with traditional measures and offering scalable, data-driven insights into language variation.
Contribution
It demonstrates the effectiveness of speech embeddings from a self-supervised model in capturing linguistic relationships, enabling large-scale analysis of language connections beyond traditional methods.
Findings
Embedding-based distances align with genealogical, lexical, and geographical measures.
Speech embeddings effectively capture global and local linguistic patterns.
Method shows promise for analyzing low-resource languages and linguistic diversity.
Abstract
Investigating linguistic relationships on a global scale requires analyzing diverse features such as syntax, phonology and prosody, which evolve at varying rates influenced by internal diversification, language contact, and sociolinguistic factors. Recent advances in machine learning (ML) offer complementary alternatives to traditional historical and typological approaches. Instead of relying on expert labor in analyzing specific linguistic features, these new methods enable the exploration of linguistic variation through embeddings derived directly from speech, opening new avenues for large-scale, data-driven analyses. This study employs embeddings from the fine-tuned XLS-R self-supervised language identification model voxlingua107-xls-r-300m-wav2vec, to analyze relationships between 106 world languages based on speech recordings. Using linear discriminant analysis (LDA), language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Authorship Attribution and Profiling · Computational and Text Analysis Methods
MethodsALIGN
