AlleNoise: large-scale text classification benchmark dataset with   real-world label noise

Alicja R\k{a}czkowska; Aleksandra Osowska-Kurczab; Jacek; Szczerbi\'nski; Kalina Jasinska-Kobus; Klaudia Nazarko

arXiv:2407.10992·cs.CL·October 24, 2024

AlleNoise: large-scale text classification benchmark dataset with real-world label noise

Alicja R\k{a}czkowska, Aleksandra Osowska-Kurczab, Jacek, Szczerbi\'nski, Kalina Jasinska-Kobus, Klaudia Nazarko

PDF

Open Access 1 Repo

TL;DR

AlleNoise is a large-scale, real-world noisy text classification dataset with over 500,000 examples, designed to benchmark and improve methods for learning with noisy labels in natural language processing.

Contribution

This paper introduces AlleNoise, a novel benchmark dataset with real-world label noise for text classification, addressing the lack of realistic noise datasets in NLP.

Findings

01

Established that existing noise-handling methods are inadequate for real-world noise

02

Demonstrated that current algorithms do not effectively reduce memorization of noisy labels

03

Provided a new benchmark for developing robust text classification models

Abstract

Label noise remains a challenge for training robust classification models. Most methods for mitigating label noise have been benchmarked using primarily datasets with synthetic noise. While the need for datasets with realistic noise distribution has partially been addressed by web-scraped benchmarks such as WebVision and Clothing1M, those benchmarks are restricted to the computer vision domain. With the growing importance of Transformer-based models, it is crucial to establish text classification benchmarks for learning with noisy labels. In this paper, we present AlleNoise, a new curated text classification benchmark dataset with real-world instance-dependent label noise, containing over 500,000 examples across approximately 5,600 classes, complemented with a meaningful, hierarchical taxonomy of categories. The noise distribution comes from actual users of a major e-commerce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

allegro/allenoise
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Machine Learning and Data Classification

MethodsSparse Evolutionary Training