Intrinsic Self-Supervision for Data Quality Audits

Fabian Gr\"oger; Simone Lionetti; Philippe Gottfrois; Alvaro; Gonzalez-Jimenez; Ludovic Amruthalingam; Labelling Consortium; Matthew Groh,; Alexander A. Navarini; Marc Pouly

arXiv:2305.17048·cs.CV·October 30, 2024·2 cites

Intrinsic Self-Supervision for Data Quality Audits

Fabian Gr\"oger, Simone Lionetti, Philippe Gottfrois, Alvaro, Gonzalez-Jimenez, Ludovic Amruthalingam, Labelling Consortium, Matthew Groh,, Alexander A. Navarini, Marc Pouly

PDF

Open Access 2 Repos 1 Video

TL;DR

SelfClean introduces a self-supervised, distance-based approach for automated data quality auditing in image datasets, significantly reducing human effort and improving the accuracy of identifying dataset issues.

Contribution

The paper presents SelfClean, a novel self-supervised method combining context-aware representations and distance metrics for effective data cleaning in vision datasets.

Findings

01

Outperforms state-of-the-art in detecting dataset issues

02

Identifies up to 16% of problematic images

03

Improves evaluation reliability after cleaning

Abstract

Benchmark datasets in computer vision often contain off-topic images, near duplicates, and label errors, leading to inaccurate estimates of model performance. In this paper, we revisit the task of data cleaning and formalize it as either a ranking problem, which significantly reduces human inspection effort, or a scoring problem, which allows for automated decisions based on score distributions. We find that a specific combination of context-aware self-supervised representation learning and distance-based indicators is effective in finding issues without annotation biases. This methodology, which we call SelfClean, surpasses state-of-the-art performance in detecting off-topic images, near duplicates, and label errors within widely-used image datasets, such as ImageNet-1k, Food-101N, and STL-10, both for synthetic issues and real contamination. We apply the detailed method to multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Intrinsic Self-Supervision for Data Quality Audits· slideslive

Taxonomy

TopicsMachine Learning and Data Classification · Domain Adaptation and Few-Shot Learning · AI in cancer detection