Feature-Refined Unsupervised Model for Loanword Detection

Promise Dodzi Kpoglu

arXiv:2508.17923·cs.CL·August 26, 2025

Feature-Refined Unsupervised Model for Loanword Detection

Promise Dodzi Kpoglu

PDF

TL;DR

This paper introduces an unsupervised, language-internal method for detecting loanwords that outperforms existing approaches across multiple Indo-European languages by iteratively refining linguistic feature analysis.

Contribution

The paper presents a novel unsupervised model that relies solely on internal linguistic features for loanword detection, avoiding external language data and improving cross-linguistic accuracy.

Findings

01

Outperforms baseline methods in loanword detection

02

Effective across six Indo-European languages

03

Shows strong scalability to multilingual datasets

Abstract

We propose an unsupervised method for detecting loanwords i.e., words borrowed from one language into another. While prior work has primarily relied on language-external information to identify loanwords, such approaches can introduce circularity and constraints into the historical linguistics workflow. In contrast, our model relies solely on language-internal information to process both native and borrowed words in monolingual and multilingual wordlists. By extracting pertinent linguistic features, scoring them, and mapping them probabilistically, we iteratively refine initial results by identifying and generalizing from emerging patterns until convergence. This hybrid approach leverages both linguistic and statistical cues to guide the discovery process. We evaluate our method on the task of isolating loanwords in datasets from six standard Indo-European languages: English, German,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.