Self-Supervised Borrowing Detection on Multilingual Wordlists
Tim Wientzek

TL;DR
This paper introduces a fully self-supervised method for detecting borrowing in multilingual wordlists, combining PMI similarities and phonetic features, outperforming some supervised methods without needing labeled data.
Contribution
It presents a novel self-supervised approach that integrates PMI-based global correspondence and phonetic contrastive learning for borrowing detection.
Findings
PMI improves over string similarity measures like NED and SCA
Combined similarity matches or exceeds supervised baselines
Method scales and operates without manual supervision
Abstract
This paper presents a fully self-supervised approach to borrowing detection in multilingual wordlists. The method combines two sources of information: PMI similarities based on a global correspondence model and a lightweight contrastive component trained on phonetic feature vectors. It further includes an automatic procedure for selecting decision thresholds without requiring labeled data. Experiments on benchmark datasets show that PMI alone already improves over existing string similarity measures such as NED and SCA, and that the combined similarity performs on par with or better than supervised baselines. An ablation study highlights the importance of character encoding, temperature settings and augmentation strategies. The approach scales to datasets of different sizes, works without manual supervision and is provided with a command-line tool that allows researchers to conduct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling · Topic Modeling
