Training Dynamic based data filtering may not work for NLP datasets
Arka Talukdar, Monika Dagar, Prachi Gupta, Varun Menon

TL;DR
This paper investigates the effectiveness of the AUM metric for filtering mislabeled data in NLP datasets, revealing it can remove some errors but also discards many correct and relevant examples, affecting model learning.
Contribution
The study evaluates the applicability of the AUM metric for mislabel detection in NLP datasets and highlights its limitations in preserving useful data.
Findings
AUM can identify some mislabeled samples in NLP datasets.
Filtering with AUM also removes many correctly labeled and relevant data.
Models tend to rely on distributional information rather than syntactic or semantic cues.
Abstract
The recent increase in dataset size has brought about significant advances in natural language understanding. These large datasets are usually collected through automation (search engines or web crawlers) or crowdsourcing which inherently introduces incorrectly labeled data. Training on these datasets leads to memorization and poor generalization. Thus, it is pertinent to develop techniques that help in the identification and isolation of mislabelled data. In this paper, we study the applicability of the Area Under the Margin (AUM) metric to identify and remove/rectify mislabelled examples in NLP datasets. We find that mislabelled samples can be filtered using the AUM metric in NLP datasets but it also removes a significant number of correctly labeled points and leads to the loss of a large amount of relevant language information. We show that models rely on the distributional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
