reproducing "ner and pos when nothing is capitalized"

Andreas Kuster; Jakub Filipek; Viswa Virinchi Muppirala

arXiv:2109.08396·cs.CL·September 20, 2021

reproducing "ner and pos when nothing is capitalized"

Andreas Kuster, Jakub Filipek, Viswa Virinchi Muppirala

PDF

Open Access 2 Repos

TL;DR

This paper reproduces a study on how lowercasing half of the dataset can mitigate performance drops in NER and POS tasks caused by casing mismatches, confirming original claims but with slightly lower results.

Contribution

The authors successfully reproduce the original findings on casing effects in NLP tasks and provide a publicly available implementation for further research.

Findings

01

Lowercasing 50% of data yields optimal performance.

02

Reproduction results are slightly lower than original claims.

03

Public GitHub repository available for transparency.

Abstract

Capitalization is an important feature in many NLP tasks such as Named Entity Recognition (NER) or Part of Speech Tagging (POS). We are trying to reproduce results of paper which shows how to mitigate a significant performance drop when casing is mismatched between training and testing data. In particular we show that lowercasing 50% of the dataset provides the best performance, matching the claims of the original paper. We also show that we got slightly lower performance in almost all experiments we have tried to reproduce, suggesting that there might be some hidden factors impacting our performance. Lastly, we make all of our work available in a public github repository.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems