To Normalize, or Not to Normalize: The Impact of Normalization on   Part-of-Speech Tagging

Rob van der Goot; Barbara Plank; Malvina Nissim

arXiv:1707.05116·cs.CL·July 18, 2017

To Normalize, or Not to Normalize: The Impact of Normalization on Part-of-Speech Tagging

Rob van der Goot, Barbara Plank, Malvina Nissim

PDF

Open Access 1 Repo

TL;DR

This paper investigates the impact of normalization on POS tagging accuracy for noisy Twitter data, finding that normalization helps but is often less effective than leveraging raw data with word embeddings.

Contribution

It provides an empirical comparison of normalization versus raw data strategies for POS tagging on social media text, highlighting the limited additional benefit of normalization.

Findings

01

Normalization improves POS tagging accuracy but is not consistently better than raw data approaches.

02

Word embedding initialization alone yields competitive POS tagging performance.

03

Normalization's benefit diminishes when using large unlabeled datasets with embeddings.

Abstract

Does normalization help Part-of-Speech (POS) tagging accuracy on noisy, non-canonical data? To the best of our knowledge, little is known on the actual impact of normalization in a real-world scenario, where gold error detection is not available. We investigate the effect of automatic normalization on POS tagging of tweets. We also compare normalization to strategies that leverage large amounts of unlabeled data kept in its raw form. Our results show that normalization helps, but does not add consistently beyond just word embedding layer initialization. The latter approach yields a tagging model that is competitive with a Twitter state-of-the-art tagger.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bplank/wnut-2017-pos-norm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems