Beyond cognacy

Gerhard J\"ager

arXiv:2507.03005·cs.CL·March 30, 2026

Beyond cognacy

Gerhard J\"ager

PDF

TL;DR

This paper compares traditional expert-annotated cognate-based phylogenetic methods with two automated approaches, finding that MSA-based inference offers more accurate and scalable language phylogenies.

Contribution

It introduces and evaluates two automated methods for language phylogeny inference, demonstrating MSA-based approach's superior performance over existing methods.

Findings

01

MSA-based inference produces trees more consistent with linguistic classifications.

02

MSA approach better predicts typological variation.

03

Phylogenetic signal is clearer with MSA-based methods.

Abstract

Computational phylogenetics has become an established tool in historical linguistics, with many language families now analyzed using likelihood-based inference. However, standard approaches rely on expert-annotated cognate sets, which are sparse, labor-intensive to produce, and limited to individual language families. This paper explores alternatives by comparing the established method to two fully automated methods that extract phylogenetic signal directly from lexical data. One uses automatic cognate clustering with unigram/concept features; the other applies multiple sequence alignment (MSA) derived from a pair-hidden Markov model. Both are evaluated against expert classifications from Glottolog and typological data from Grambank. Also, the intrinsic strengths of the phylogenetic signal in the characters are compared. Results show that MSA-based inference yields trees more consistent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.