Using Correspondence Patterns to Identify Irregular Words in Cognate sets Through Leave-One-Out Validation

Frederic Blum; Johann-Mattis List

arXiv:2602.02221·cs.CL·February 3, 2026

Using Correspondence Patterns to Identify Irregular Words in Cognate sets Through Leave-One-Out Validation

Frederic Blum, Johann-Mattis List

PDF

Open Access 1 Video

TL;DR

This paper introduces a new computational measure of regularity in cognate sets and a method to identify irregular words, validated through experiments showing high accuracy, which can enhance data quality in historical linguistics.

Contribution

It presents a novel regularity measure and an irregular cognate identification method using leave-one-out validation, improving the evaluation of word correspondences in language comparison.

Findings

01

Achieved 85% accuracy in identifying irregular words

02

Validated method with simulated and real data

03

Demonstrated benefits of dataset subsampling

Abstract

Regular sound correspondences constitute the principal evidence in historical language comparison. Despite the heuristic focus on regularity, it is often more an intuitive judgement than a quantified evaluation, and irregularity is more common than expected from the Neogrammarian model. Given the recent progress of computational methods in historical linguistics and the increased availability of standardized lexical data, we are now able to improve our workflows and provide such a quantitative evaluation. Here, we present the balanced average recurrence of correspondence patterns as a new measure of regularity. We also present a new computational method that uses this measure to identify cognate sets that lack regularity with respect to their correspondence patterns. We validate the method through two experiments, using simulated and real data. In the experiments, we employ…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Using Correspondence Patterns to Identify Irregular Words in Cognate Sets Through Leave-One-Out Validation· underline

Taxonomy

TopicsLanguage and cultural evolution · Authorship Attribution and Profiling · Natural Language Processing Techniques