CleanPatrick: A Benchmark for Image Data Cleaning

Fabian Gr\"oger; Simone Lionetti; Philippe Gottfrois; Alvaro Gonzalez-Jimenez; Ludovic Amruthalingam; Elisabeth Victoria Goessinger; Hanna Lindemann; Marie Bargiela; Marie Hofbauer; Omar Badri; Philipp Tschandl; Arash Koochek; Matthew Groh; Alexander A. Navarini; Marc Pouly

arXiv:2505.11034·cs.CV·May 19, 2025

CleanPatrick: A Benchmark for Image Data Cleaning

Fabian Gr\"oger, Simone Lionetti, Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Ludovic Amruthalingam, Elisabeth Victoria Goessinger, Hanna Lindemann, Marie Bargiela, Marie Hofbauer, Omar Badri, Philipp Tschandl, Arash Koochek, Matthew Groh, Alexander A. Navarini, Marc Pouly

PDF

Open Access 1 Repo

TL;DR

CleanPatrick is a large-scale, real-world benchmark for image data cleaning that enables systematic comparison of cleaning methods, highlighting the strengths of self-supervised representations and the challenges in label-error detection.

Contribution

It introduces the first comprehensive, real-world benchmark for image data cleaning, with a large annotated dataset and evaluation framework based on medical images.

Findings

01

Self-supervised representations excel at near-duplicate detection.

02

Classical methods are effective for off-topic detection under limited review budgets.

03

Label-error detection remains a significant challenge in medical image classification.

Abstract

Robust machine learning depends on clean data, yet current image data cleaning benchmarks rely on synthetic noise or narrow human studies, limiting comparison and real-world relevance. We introduce CleanPatrick, the first large-scale benchmark for data cleaning in the image domain, built upon the publicly available Fitzpatrick17k dermatology dataset. We collect 496,377 binary annotations from 933 medical crowd workers, identify off-topic samples (4%), near-duplicates (21%), and label errors (22%), and employ an aggregation model inspired by item-response theory followed by expert review to derive high-quality ground truth. CleanPatrick formalizes issue detection as a ranking task and adopts typical ranking metrics mirroring real audit workflows. Benchmarking classical anomaly detectors, perceptual hashing, SSIM, Confident Learning, NoiseRank, and SelfClean, we find that, on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

digital-dermatology/cleanpatrick
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Cell Image Analysis Techniques · Adversarial Robustness in Machine Learning