Data Leakage in Visual Datasets
Patrick Ramos, Ryan Ramos, Noa Garcia

TL;DR
This paper investigates data leakage in visual datasets, revealing that all analyzed benchmarks contain some form of leakage which undermines the fairness and reliability of model evaluation.
Contribution
It systematically characterizes types of visual data leakage and demonstrates its presence across multiple datasets using image retrieval methods.
Findings
All datasets analyzed exhibit some form of leakage.
Leakage types range from severe to subtle, affecting evaluation integrity.
Leakage compromises the reliability of downstream model assessments.
Abstract
We analyze data leakage in visual datasets. Data leakage refers to images in evaluation benchmarks that have been seen during training, compromising fair model evaluation. Given that large-scale datasets are often sourced from the internet, where many computer vision benchmarks are publicly available, our efforts are focused into identifying and studying this phenomenon. We characterize visual leakage into different types according to its modality, coverage, and degree. By applying image retrieval techniques, we unequivocally show that all the analyzed datasets present some form of leakage, and that all types of leakage, from severe instances to more subtle cases, compromise the reliability of model evaluation in downstream tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
