MoNoise: Modeling Noise Using a Modular Normalization System
Rob van der Goot, Gertjan van Noord

TL;DR
MoNoise is a modular normalization system that improves text normalization from social media to standard language, demonstrating high adaptability and superior performance across English and Dutch benchmarks.
Contribution
The paper introduces MoNoise, a flexible, modular normalization model with a candidate generation and ranking approach, outperforming existing methods on multiple benchmarks.
Findings
MoNoise outperforms state-of-the-art normalization methods.
The modular candidate generation improves adaptability.
Features from generation modules and N-grams enhance ranking accuracy.
Abstract
We propose MoNoise: a normalization model focused on generalizability and efficiency, it aims at being easily reusable and adaptable. Normalization is the task of translating texts from a non- canonical domain to a more canonical domain, in our case: from social media data to standard language. Our proposed model is based on a modular candidate generation in which each module is responsible for a different type of normalization action. The most important generation modules are a spelling correction system and a word embeddings module. Depending on the definition of the normalization task, a static lookup list can be crucial for performance. We train a random forest classifier to rank the candidates, which generalizes well to all different types of normaliza- tion actions. Most features for the ranking originate from the generation modules; besides these features, N-gram features prove…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling
