An accurate detection is not all you need to combat label noise in   web-noisy datasets

Paul Albert; Jack Valmadre; Eric Arazo; Tarun Krishna; Noel E.; O'Connor; Kevin McGuinness

arXiv:2407.05528·cs.CV·July 9, 2024

An accurate detection is not all you need to combat label noise in web-noisy datasets

Paul Albert, Jack Valmadre, Eric Arazo, Tarun Krishna, Noel E., O'Connor, Kevin McGuinness

PDF

Open Access 1 Repo

TL;DR

This paper investigates the limitations of using hyperplane-based out-of-distribution detection in noisy web datasets and proposes a hybrid method combining linear separation and small-loss techniques to improve classification accuracy.

Contribution

It reveals that hyperplane-based OOD detection misses valuable clean examples and introduces a hybrid approach that enhances noise robustness in web-crawled datasets.

Findings

01

Linear hyperplane detection accurately identifies OOD samples.

02

Hybrid method improves classification accuracy on noisy datasets.

03

Combining linear separation with SOTA small-loss methods yields state-of-the-art results.

Abstract

Training a classifier on web-crawled data demands learning algorithms that are robust to annotation errors and irrelevant examples. This paper builds upon the recent empirical observation that applying unsupervised contrastive learning to noisy, web-crawled datasets yields a feature representation under which the in-distribution (ID) and out-of-distribution (OOD) samples are linearly separable. We show that direct estimation of the separating hyperplane can indeed offer an accurate detection of OOD samples, and yet, surprisingly, this detection does not translate into gains in classification accuracy. Digging deeper into this phenomenon, we discover that the near-perfect detection misses a type of clean examples that are valuable for supervised learning. These examples often represent visually simple images, which are relatively easy to identify as clean examples using standard loss- or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

paulalbert31/lsa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification

MethodsContrastive Learning