You can't handle the (dirty) truth: Data-centric insights improve   pseudo-labeling

Nabeel Seedat; Nicolas Huynh; Fergus Imrie; Mihaela van der Schaar

arXiv:2406.13733·cs.LG·June 21, 2024

You can't handle the (dirty) truth: Data-centric insights improve pseudo-labeling

Nabeel Seedat, Nicolas Huynh, Fergus Imrie, Mihaela van der Schaar

PDF

Open Access 1 Repo

TL;DR

This paper emphasizes the importance of data quality in pseudo-labeling for semi-supervised learning, introducing a novel framework called DIPS that enhances pseudo-labeling by analyzing learning dynamics and improving data selection.

Contribution

The paper presents DIPS, a new data characterization and selection framework that improves pseudo-labeling by considering data quality and learning dynamics, applicable across various datasets.

Findings

01

DIPS improves pseudo-labeling performance across tabular and image datasets.

02

DIPS enhances data efficiency and reduces performance gaps between pseudo-labelers.

03

Data quality analysis significantly benefits semi-supervised learning methods.

Abstract

Pseudo-labeling is a popular semi-supervised learning technique to leverage unlabeled data when labeled samples are scarce. The generation and selection of pseudo-labels heavily rely on labeled data. Existing approaches implicitly assume that the labeled data is gold standard and 'perfect'. However, this can be violated in reality with issues such as mislabeling or ambiguity. We address this overlooked aspect and show the importance of investigating labeled data quality to improve any pseudo-labeling method. Specifically, we introduce a novel data characterization and selection framework called DIPS to extend pseudo-labeling. We select useful labeled and pseudo-labeled samples via analysis of learning dynamics. We demonstrate the applicability and impact of DIPS for various pseudo-labeling methods across an extensive range of real-world tabular and image datasets. Additionally, DIPS…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

seedatnabeel/dips
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques