Text normalization for low-resource languages: the case of Ligurian

Stefano Lusito; Edoardo Ferrante; Jean Maillard

arXiv:2206.07861·cs.CL·December 25, 2023·1 cites

Text normalization for low-resource languages: the case of Ligurian

Stefano Lusito, Edoardo Ferrante, Jean Maillard

PDF

Open Access 1 Repo

TL;DR

This paper explores text normalization for Ligurian, an endangered low-resource language, demonstrating that a transformer-based model can effectively normalize text despite limited data, using innovative training techniques.

Contribution

It introduces the first open source Ligurian corpus and shows that neural methods outperform rule-based approaches in low-resource language normalization.

Findings

01

Transformer model achieves low error rates with limited data

02

Backtranslation improves normalization performance

03

First open source Ligurian corpus created

Abstract

Text normalization is a crucial technology for low-resource languages which lack rigid spelling conventions or that have undergone multiple spelling reforms. Low-resource text normalization has so far relied upon hand-crafted rules, which are perceived to be more data efficient than neural methods. In this paper we examine the case of text normalization for Ligurian, an endangered Romance language. We collect 4,394 Ligurian sentences paired with their normalized versions, as well as the first open source monolingual corpus for Ligurian. We show that, in spite of the small amounts of data available, a compact transformer-based model can be trained to achieve very low error rates by the use of backtranslation and appropriate tokenization.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fleanend/fairseq-text-normalizer
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling