Subword-Based Comparative Linguistics across 242 Languages Using Wikipedia Glottosets
Iaroslav Chelombitko, Mika H\"am\"al\"ainen, Aleksey Komissarov

TL;DR
This study introduces a large-scale, subword-based framework for cross-linguistic comparison of 242 languages using Wikipedia lexicons, revealing significant correlations between subword similarity and linguistic relatedness.
Contribution
It presents a novel framework utilizing Byte-Pair Encoding and rank-based subword vectors for scalable, quantitative comparison of lexical similarities across many languages.
Findings
BPE segmentation aligns with morpheme boundaries 95% better than random.
BPE vocabulary similarity correlates with genetic language relatedness (Mantel r=0.329).
Nearly half of cross-linguistic homographs have different segmentations, varying with phylogenetic distance.
Abstract
We present a large-scale comparative study of 242 Latin and Cyrillic-script languages using subword-based methodologies. By constructing 'glottosets' from Wikipedia lexicons, we introduce a framework for simultaneous cross-linguistic comparison via Byte-Pair Encoding (BPE). Our approach utilizes rank-based subword vectors to analyze vocabulary overlap, lexical divergence, and language similarity at scale. Evaluations demonstrate that BPE segmentation aligns with morpheme boundaries 95% better than random baseline across 15 languages (F1 = 0.34 vs 0.15). BPE vocabulary similarity correlates significantly with genetic language relatedness (Mantel r = 0.329, p < 0.001), with Romance languages forming the tightest cluster (mean distance 0.51) and cross-family pairs showing clear separation (0.82). Analysis of 26,939 cross-linguistic homographs reveals that 48.7% receive different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Language and cultural evolution · Authorship Attribution and Profiling
