Scaling Self-Supervised Speech Models Uncovers Deep Linguistic Relationships: Evidence from the Pacific Cluster
Minu Kim, Hoirin Kim, David R. Mortensen

TL;DR
Scaling self-supervised speech models from 126 to over 4,000 languages reveals their ability to uncover deep linguistic relationships, including genealogical lineages and contact phenomena, especially at larger scales.
Contribution
This work demonstrates that massive S3Ms can internalize complex language history signals, shifting from surface similarities to deep genealogical and contact-based relationships.
Findings
Phylogenetic recovery stagnates up to 1K languages
A dramatic shift occurs at 4K scale, revealing deep language relationships
A robust Pacific macro-cluster emerges, capturing language contact and history
Abstract
Similarities between language representations derived from Self-Supervised Speech Models (S3Ms) have been observed to primarily reflect geographic proximity or surface typological similarities driven by recent expansion or contact, potentially missing deeper genealogical signals. We investigate how scaling linguistic coverage of an S3M-based language identification system from 126 to 4,017 languages influences this topology. Our results reveal a non-linear effect: while phylogenetic recovery remains stagnant up to the 1K scale, the 4K model displays a dramatic qualitative shift, resolving both clear lineages and complex, long-term linguistic contact. Notably, our analysis reveals the emergence of a robust macro-cluster in the Pacific (comprising Papuan, Oceanic, and Australian languages) and investigates its latent drivers. We find that the 4K model utilizes a more concentrated encoding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Forensic and Genetic Research · Animal Vocal Communication and Behavior
