AlleNoise: large-scale text classification benchmark dataset with real-world label noise
Alicja R\k{a}czkowska, Aleksandra Osowska-Kurczab, Jacek, Szczerbi\'nski, Kalina Jasinska-Kobus, Klaudia Nazarko

TL;DR
AlleNoise is a large-scale, real-world noisy text classification dataset with over 500,000 examples, designed to benchmark and improve methods for learning with noisy labels in natural language processing.
Contribution
This paper introduces AlleNoise, a novel benchmark dataset with real-world label noise for text classification, addressing the lack of realistic noise datasets in NLP.
Findings
Established that existing noise-handling methods are inadequate for real-world noise
Demonstrated that current algorithms do not effectively reduce memorization of noisy labels
Provided a new benchmark for developing robust text classification models
Abstract
Label noise remains a challenge for training robust classification models. Most methods for mitigating label noise have been benchmarked using primarily datasets with synthetic noise. While the need for datasets with realistic noise distribution has partially been addressed by web-scraped benchmarks such as WebVision and Clothing1M, those benchmarks are restricted to the computer vision domain. With the growing importance of Transformer-based models, it is crucial to establish text classification benchmarks for learning with noisy labels. In this paper, we present AlleNoise, a new curated text classification benchmark dataset with real-world instance-dependent label noise, containing over 500,000 examples across approximately 5,600 classes, complemented with a meaningful, hierarchical taxonomy of categories. The noise distribution comes from actual users of a major e-commerce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Machine Learning and Data Classification
MethodsSparse Evolutionary Training
