Self-Supervised Borrowing Detection on Multilingual Wordlists

Tim Wientzek

arXiv:2512.01713·cs.CL·December 2, 2025

Self-Supervised Borrowing Detection on Multilingual Wordlists

Tim Wientzek

PDF

Open Access

TL;DR

This paper introduces a fully self-supervised method for detecting borrowing in multilingual wordlists, combining PMI similarities and phonetic features, outperforming some supervised methods without needing labeled data.

Contribution

It presents a novel self-supervised approach that integrates PMI-based global correspondence and phonetic contrastive learning for borrowing detection.

Findings

01

PMI improves over string similarity measures like NED and SCA

02

Combined similarity matches or exceeds supervised baselines

03

Method scales and operates without manual supervision

Abstract

This paper presents a fully self-supervised approach to borrowing detection in multilingual wordlists. The method combines two sources of information: PMI similarities based on a global correspondence model and a lightweight contrastive component trained on phonetic feature vectors. It further includes an automatic procedure for selecting decision thresholds without requiring labeled data. Experiments on benchmark datasets show that PMI alone already improves over existing string similarity measures such as NED and SCA, and that the combined similarity performs on par with or better than supervised baselines. An ablation study highlights the importance of character encoding, temperature settings and augmentation strategies. The approach scales to datasets of different sizes, works without manual supervision and is provided with a command-line tool that allows researchers to conduct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling · Topic Modeling