The Cognate Data Bottleneck in Language Phylogenetics

Luise H\"auser; Alexandros Stamatakis

arXiv:2507.00911·cs.CL·July 2, 2025

The Cognate Data Bottleneck in Language Phylogenetics

Luise H\"auser, Alexandros Stamatakis

PDF

Open Access

TL;DR

This paper investigates the challenge of limited cognate data for phylogenetic analysis in linguistics, demonstrating that current automatic data extraction methods from resources like BabelNet are insufficient for reliable phylogenetic inference.

Contribution

It provides an empirical evaluation of automatically extracted cognate datasets from BabelNet, highlighting their inadequacy for phylogenetic methods and discussing the limitations of current data sources.

Findings

01

Automatically extracted datasets are inconsistent with gold standard trees.

02

Current multilingual resources do not yield suitable character matrices.

03

Larger datasets are necessary but difficult to obtain for phylogenetic analysis.

Abstract

To fully exploit the potential of computational phylogenetic methods for cognate data one needs to leverage specific (complex) models an machine learning-based techniques. However, both approaches require datasets that are substantially larger than the manually collected cognate data currently available. To the best of our knowledge, there exists no feasible approach to automatically generate larger cognate datasets. We substantiate this claim by automatically extracting datasets from BabelNet, a large multilingual encyclopedic dictionary. We demonstrate that phylogenetic inferences on the respective character matrices yield trees that are largely inconsistent with the established gold standard ground truth trees. We also discuss why we consider it as being unlikely to be able to extract more suitable character matrices from other multilingual resources. Phylogenetic data analysis…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage and cultural evolution · Authorship Attribution and Profiling · Natural Language Processing Techniques