Indo-European languages tree by Levenshtein distance
Maurizio Serva, Filippo Petroni

TL;DR
This paper introduces a new method for constructing language trees using a normalized Levenshtein distance to measure lexical differences, reducing subjectivity and improving reproducibility in linguistic phylogenetics.
Contribution
It proposes a novel genetic distance measure based on Levenshtein distance for language comparison, enhancing objectivity over traditional cognate-based methods.
Findings
The resulting language tree closely matches established Indo-European phylogenies.
The method reduces subjectivity and increases reproducibility in language classification.
Significant differences from previous trees highlight the impact of the new distance measure.
Abstract
The evolution of languages closely resembles the evolution of haploid organisms. This similarity has been recently exploited \cite{GA,GJ} to construct language trees. The key point is the definition of a distance among all pairs of languages which is the analogous of a genetic distance. Many methods have been proposed to define these distances, one of this, used by glottochronology, compute distance from the percentage of shared ``cognates''. Cognates are words inferred to have a common historical origin, and subjective judgment plays a relevant role in the identification process. Here we push closer the analogy with evolutionary biology and we introduce a genetic distance among language pairs by considering a renormalized Levenshtein distance among words with same meaning and averaging on all the words contained in a Swadesh list \cite{Sw}. The subjectivity of process is consistently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
