Text normalization for low-resource languages: the case of Ligurian
Stefano Lusito, Edoardo Ferrante, Jean Maillard

TL;DR
This paper explores text normalization for Ligurian, an endangered low-resource language, demonstrating that a transformer-based model can effectively normalize text despite limited data, using innovative training techniques.
Contribution
It introduces the first open source Ligurian corpus and shows that neural methods outperform rule-based approaches in low-resource language normalization.
Findings
Transformer model achieves low error rates with limited data
Backtranslation improves normalization performance
First open source Ligurian corpus created
Abstract
Text normalization is a crucial technology for low-resource languages which lack rigid spelling conventions or that have undergone multiple spelling reforms. Low-resource text normalization has so far relied upon hand-crafted rules, which are perceived to be more data efficient than neural methods. In this paper we examine the case of text normalization for Ligurian, an endangered Romance language. We collect 4,394 Ligurian sentences paired with their normalized versions, as well as the first open source monolingual corpus for Ligurian. We show that, in spite of the small amounts of data available, a compact transformer-based model can be trained to achieve very low error rates by the use of backtranslation and appropriate tokenization.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
