Global-scale phylogenetic linguistic inference from lexical resources
Gerhard J\"ager

TL;DR
This paper introduces machine learning methods to automate phylogenetic linguistic inference from large lexical datasets, expanding scope beyond expert judgments and enabling analysis of extensive language diversity.
Contribution
It develops new techniques for automatic cognate detection and character creation, facilitating large-scale phylogenetic analysis without expert input.
Findings
Effective dissimilarity matrix for phylogenetic inference
Successful supervised cognate clustering with SVM
Binary characters suitable for phylogenetic analysis
Abstract
Automatic phylogenetic inference plays an increasingly important role in computational historical linguistics. Most pertinent work is currently based on expert cognate judgments. This limits the scope of this approach to a small number of well-studied language families. We used machine learning techniques to compile data suitable for phylogenetic inference from the ASJP database, a collection of almost 7,000 phonetically transcribed word lists over 40 concepts, covering two third of the extant world-wide linguistic diversity. First, we estimated Pointwise Mutual Information scores between sound classes using weighted sequence alignment and general-purpose optimization. From this we computed a dissimilarity matrix over all ASJP word lists. This matrix is suitable for distance-based phylogenetic inference. Second, we applied cognate clustering to the ASJP data, using supervised training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSupport Vector Machine
