To Normalize, or Not to Normalize: The Impact of Normalization on Part-of-Speech Tagging
Rob van der Goot, Barbara Plank, Malvina Nissim

TL;DR
This paper investigates the impact of normalization on POS tagging accuracy for noisy Twitter data, finding that normalization helps but is often less effective than leveraging raw data with word embeddings.
Contribution
It provides an empirical comparison of normalization versus raw data strategies for POS tagging on social media text, highlighting the limited additional benefit of normalization.
Findings
Normalization improves POS tagging accuracy but is not consistently better than raw data approaches.
Word embedding initialization alone yields competitive POS tagging performance.
Normalization's benefit diminishes when using large unlabeled datasets with embeddings.
Abstract
Does normalization help Part-of-Speech (POS) tagging accuracy on noisy, non-canonical data? To the best of our knowledge, little is known on the actual impact of normalization in a real-world scenario, where gold error detection is not available. We investigate the effect of automatic normalization on POS tagging of tweets. We also compare normalization to strategies that leverage large amounts of unlabeled data kept in its raw form. Our results show that normalization helps, but does not add consistently beyond just word embedding layer initialization. The latter approach yields a tagging model that is competitive with a Twitter state-of-the-art tagger.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
