Deciphering Undersegmented Ancient Scripts Using Phonetic Prior

Jiaming Luo; Frederik Hartmann; Enrico Santus; Yuan Cao; Regina; Barzilay

arXiv:2010.11054·cs.CL·October 22, 2020·1 cites

Deciphering Undersegmented Ancient Scripts Using Phonetic Prior

Jiaming Luo, Frederik Hartmann, Enrico Santus, Yuan Cao, Regina, Barzilay

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces a phonetic-aware model for deciphering ancient scripts that are not segmented into words and lack known related languages, leveraging linguistic constraints and IPA-based embeddings.

Contribution

It presents a novel generative framework that jointly models word segmentation and cognate alignment using phonological constraints and IPA-based character embeddings.

Findings

01

Improves decipherment accuracy on Gothic and Ugaritic scripts.

02

Proposes a measure for language closeness that aligns with scholarly consensus.

03

Demonstrates the model's effectiveness on both deciphered and undeciphered languages.

Abstract

Most undeciphered lost languages exhibit two characteristics that pose significant decipherment challenges: (1) the scripts are not fully segmented into words; (2) the closest known language is not determined. We propose a decipherment model that handles both of these challenges by building on rich linguistic constraints reflecting consistent patterns in historical sound change. We capture the natural phonological geometry by learning character embeddings based on the International Phonetic Alphabet (IPA). The resulting generative framework jointly models word segmentation and cognate alignment, informed by phonological constraints. We evaluate the model on both deciphered languages (Gothic, Ugaritic) and an undeciphered one (Iberian). The experiments show that incorporating phonetic geometry leads to clear and consistent gains. Additionally, we propose a measure for language closeness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

j-luo93/xib
pytorchOfficial

Datasets

Nacryos/ancient-scripts-datasets
dataset· 1.3k dl
1.3k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Phonetics and Phonology Research · Speech Recognition and Synthesis