Using Correspondence Patterns to Identify Irregular Words in Cognate sets Through Leave-One-Out Validation
Frederic Blum, Johann-Mattis List

TL;DR
This paper introduces a new computational measure of regularity in cognate sets and a method to identify irregular words, validated through experiments showing high accuracy, which can enhance data quality in historical linguistics.
Contribution
It presents a novel regularity measure and an irregular cognate identification method using leave-one-out validation, improving the evaluation of word correspondences in language comparison.
Findings
Achieved 85% accuracy in identifying irregular words
Validated method with simulated and real data
Demonstrated benefits of dataset subsampling
Abstract
Regular sound correspondences constitute the principal evidence in historical language comparison. Despite the heuristic focus on regularity, it is often more an intuitive judgement than a quantified evaluation, and irregularity is more common than expected from the Neogrammarian model. Given the recent progress of computational methods in historical linguistics and the increased availability of standardized lexical data, we are now able to improve our workflows and provide such a quantitative evaluation. Here, we present the balanced average recurrence of correspondence patterns as a new measure of regularity. We also present a new computational method that uses this measure to identify cognate sets that lack regularity with respect to their correspondence patterns. We validate the method through two experiments, using simulated and real data. In the experiments, we employ…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsLanguage and cultural evolution · Authorship Attribution and Profiling · Natural Language Processing Techniques
