Data Leakage in Visual Datasets

Patrick Ramos; Ryan Ramos; Noa Garcia

arXiv:2508.17416·cs.CV·August 26, 2025

Data Leakage in Visual Datasets

Patrick Ramos, Ryan Ramos, Noa Garcia

PDF

TL;DR

This paper investigates data leakage in visual datasets, revealing that all analyzed benchmarks contain some form of leakage which undermines the fairness and reliability of model evaluation.

Contribution

It systematically characterizes types of visual data leakage and demonstrates its presence across multiple datasets using image retrieval methods.

Findings

01

All datasets analyzed exhibit some form of leakage.

02

Leakage types range from severe to subtle, affecting evaluation integrity.

03

Leakage compromises the reliability of downstream model assessments.

Abstract

We analyze data leakage in visual datasets. Data leakage refers to images in evaluation benchmarks that have been seen during training, compromising fair model evaluation. Given that large-scale datasets are often sourced from the internet, where many computer vision benchmarks are publicly available, our efforts are focused into identifying and studying this phenomenon. We characterize visual leakage into different types according to its modality, coverage, and degree. By applying image retrieval techniques, we unequivocally show that all the analyzed datasets present some form of leakage, and that all types of leakage, from severe instances to more subtle cases, compromise the reliability of model evaluation in downstream tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.