Phonological Fossils: Machine Learning Detection of Non-Mainstream Vocabulary in Sulawesi Basic Lexicon
Mukhlis Amien, Go Frendi Gunawan

TL;DR
This study uses machine learning to identify non-mainstream vocabulary in Sulawesi languages, revealing phonological patterns and geographic distribution without evidence of a single substrate language.
Contribution
It combines rule-based cognate subtraction with a phonological classifier to detect non-conforming vocabulary, advancing computational methods in historical linguistics.
Findings
Identified 438 candidate substrate forms (26.5%) in Sulawesi languages.
The classifier achieved an AUC of 0.763 in distinguishing non-mainstream forms.
No evidence found for a single pre-Austronesian language layer.
Abstract
Basic vocabulary in many Sulawesi Austronesian languages includes forms resisting reconstruction to any proto-form with phonological patterns inconsistent with inherited roots, but whether this non-conforming vocabulary represents pre-Austronesian substrate or independent innovation has not been tested computationally. We combine rule-based cognate subtraction with a machine learning classifier trained on phonological features. Using 1,357 forms from six Sulawesi languages in the Austronesian Basic Vocabulary Database, we identify 438 candidate substrate forms (26.5%) through cognate subtraction and Proto-Austronesian cross-checking. An XGBoost classifier trained on 26 phonological features distinguishes inherited from non-mainstream forms with AUC=0.763, revealing a phonological fingerprint: longer forms, more consonant clusters, higher glottal stop rates, and fewer Austronesian…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
