Neural Text Normalization for Luxembourgish using Real-Life Variation   Data

Anne-Marie Lutgen; Alistair Plum; Christoph Purschke; Barbara Plank

arXiv:2412.09383·cs.CL·December 16, 2024

Neural Text Normalization for Luxembourgish using Real-Life Variation Data

Anne-Marie Lutgen, Alistair Plum, Christoph Purschke, Barbara Plank

PDF

Open Access

TL;DR

This paper presents the first sequence-to-sequence models for Luxembourgish text normalization, leveraging real-life variation data and evaluating different architectures to address orthographic variation and data scarcity.

Contribution

Introduces novel sequence-to-sequence normalization models for Luxembourgish using ByT5 and mT5 architectures trained on real-life variation data, with comprehensive evaluation.

Findings

01

Sequence models effectively normalize Luxembourgish text.

02

Byte-based models outperform word-based models in certain scenarios.

03

Real-life variation data improves normalization accuracy.

Abstract

Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult task given the lack of annotated and parallel data, which is exacerbated by ongoing standardization. In this paper, we propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures with training data obtained from word-level real-life variation data. We perform a fine-grained, linguistically-motivated evaluation to test byte-based, word-based and pipeline-based models for their strengths and weaknesses in text normalization. We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsLinear Layer · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dense Connections · Byte Pair Encoding · Residual Connection · Attention Dropout · Gated Linear Unit · Multi-Head Attention