Sampling the Swadesh List to Identify Similar Languages with Tree Spaces
Garett Ordway, Vic Patrangenaru

TL;DR
This paper explores a novel method for analyzing language relationships using simplified tree spaces and clustering techniques based on Swadesh list data, aiming to identify language ancestry and similarities.
Contribution
It introduces a new approach combining open book data analysis, 3-spider tree spaces, and single linkage clustering to study language relationships from Swadesh lists.
Findings
Identified non-sticky and sticky sample means indicating different ancestral relationships.
Demonstrated the use of 3-spider tree spaces for language clustering.
Provided initial results on language ancestry inference.
Abstract
Communication plays a vital role in human interaction. Studying language is a worthwhile task and more recently has become quantitative in nature with developments of fields like quantitative comparative linguistics and lexicostatistics. With respect to the authors own native languages, the ancestry of the English language and the Latin alphabet are of the primary interest. The Indo-European Tree traces many modern languages back to the Proto-Indo-European root. Swadesh's cognates played a large role in developing that historical perspective where some of the primary branches are Germanic, Celtic, Italic, and Balto-Slavic. This paper will use data analysis on open books where the simplest singular space is the 3-spider - a union T3 of three rays with their endpoints glued at a point 0 - which can represent these tree spaces for language clustering. These trees are built using a single…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · semigroups and automata theory
