Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching

Stephen Gadd

arXiv:2601.06932·cs.CL·March 31, 2026

Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching

Stephen Gadd

PDF

1 Models

TL;DR

Symphonym is a neural embedding system that maps toponyms from twenty writing systems into a unified phonetic space, enabling cross-script name matching without language-specific resources.

Contribution

It introduces a novel Teacher-Student neural architecture trained on large multilingual toponym datasets for cross-script name matching.

Findings

01

Achieves 85.2% Recall@1 on MEHDIE benchmark

02

Demonstrates cross-temporal generalization to historical sources

03

Outperforms previous methods in cross-script toponym matching

Abstract

Matching place names across writing systems is a persistent obstacle to the integration of multilingual geographic sources, whether modern gazetteers, medieval itineraries, or colonial-era surveys. Existing approaches depend on language-specific phonetic algorithms or romanisation steps that discard phonetic information, and none generalises across script boundaries. This paper presents Symphonym, a neural embedding system which maps toponyms from twenty writing systems into a unified 128-dimensional phonetic space, enabling direct cross-script similarity comparison without language identification or phonetic resources at inference time. A Teacher-Student knowledge distillation architecture first learns from articulatory phonetic features derived from IPA transcriptions, then transfers this knowledge to a character-level Student model. Trained on 32.7 million triplet samples drawn from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
docuracy/symphonym-v7
model· 3 dl
3 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.