Intrinsic Self-Supervision for Data Quality Audits
Fabian Gr\"oger, Simone Lionetti, Philippe Gottfrois, Alvaro, Gonzalez-Jimenez, Ludovic Amruthalingam, Labelling Consortium, Matthew Groh,, Alexander A. Navarini, Marc Pouly

TL;DR
SelfClean introduces a self-supervised, distance-based approach for automated data quality auditing in image datasets, significantly reducing human effort and improving the accuracy of identifying dataset issues.
Contribution
The paper presents SelfClean, a novel self-supervised method combining context-aware representations and distance metrics for effective data cleaning in vision datasets.
Findings
Outperforms state-of-the-art in detecting dataset issues
Identifies up to 16% of problematic images
Improves evaluation reliability after cleaning
Abstract
Benchmark datasets in computer vision often contain off-topic images, near duplicates, and label errors, leading to inaccurate estimates of model performance. In this paper, we revisit the task of data cleaning and formalize it as either a ranking problem, which significantly reduces human inspection effort, or a scoring problem, which allows for automated decisions based on score distributions. We find that a specific combination of context-aware self-supervised representation learning and distance-based indicators is effective in finding issues without annotation biases. This methodology, which we call SelfClean, surpasses state-of-the-art performance in detecting off-topic images, near duplicates, and label errors within widely-used image datasets, such as ImageNet-1k, Food-101N, and STL-10, both for synthetic issues and real contamination. We apply the detailed method to multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMachine Learning and Data Classification · Domain Adaptation and Few-Shot Learning · AI in cancer detection
