Neural Text Normalization for Luxembourgish using Real-Life Variation Data
Anne-Marie Lutgen, Alistair Plum, Christoph Purschke, Barbara Plank

TL;DR
This paper presents the first sequence-to-sequence models for Luxembourgish text normalization, leveraging real-life variation data and evaluating different architectures to address orthographic variation and data scarcity.
Contribution
Introduces novel sequence-to-sequence normalization models for Luxembourgish using ByT5 and mT5 architectures trained on real-life variation data, with comprehensive evaluation.
Findings
Sequence models effectively normalize Luxembourgish text.
Byte-based models outperform word-based models in certain scenarios.
Real-life variation data improves normalization accuracy.
Abstract
Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult task given the lack of annotated and parallel data, which is exacerbated by ongoing standardization. In this paper, we propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures with training data obtained from word-level real-life variation data. We perform a fine-grained, linguistically-motivated evaluation to test byte-based, word-based and pipeline-based models for their strengths and weaknesses in text normalization. We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsLinear Layer · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dense Connections · Byte Pair Encoding · Residual Connection · Attention Dropout · Gated Linear Unit · Multi-Head Attention
