Lexical Normalization for Code-switched Data and its Effect on POS-tagging
Rob van der Goot, \"Ozlem \c{C}etino\u{g}lu

TL;DR
This paper introduces three lexical normalization models tailored for code-switched social media data, demonstrating improved POS tagging accuracy across Indonesian-English and Turkish-German language pairs.
Contribution
The paper presents novel normalization models specifically designed for code-switched data, including new layers and tagging schemes for Turkish-German, and evaluates their impact on POS tagging.
Findings
CS-specific normalization models outperform monolingual models.
Normalization improves POS tagging accuracy by 5.4%.
Introduces new normalization layers and tags for Turkish-German.
Abstract
Lexical normalization, the translation of non-canonical data to standard language, has shown to improve the performance of manynatural language processing tasks on social media. Yet, using multiple languages in one utterance, also called code-switching (CS), is frequently overlooked by these normalization systems, despite its common use in social media. In this paper, we propose three normalization models specifically designed to handle code-switched data which we evaluate for two language pairs: Indonesian-English (Id-En) and Turkish-German (Tr-De). For the latter, we introduce novel normalization layers and their corresponding language ID and POS tags for the dataset, and evaluate the downstream effect of normalization on POS tagging. Results show that our CS-tailored normalization models outperform Id-En state of the art and Tr-De monolingual models, and lead to 5.4% relative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
