AI slipping on tiles: data leakage in digital pathology
Nicole Bussola, Alessia Marcolini, Valerio Maggio, Giuseppe Jurman,, Cesare Furlanello

TL;DR
This paper investigates how data leakage in digital pathology AI models can inflate performance metrics, demonstrates its impact through experiments, and proposes an automated pipeline to prevent leakage, enhancing reproducibility and reliability.
Contribution
It highlights the overlooked issue of data leakage in digital pathology, quantifies its effect on model performance, and introduces histolab, a Python package to create leakage-free deep learning pipelines.
Findings
Data leakage can inflate predictive scores by up to 41%.
Proper data partitioning is crucial to avoid leakage in histology data.
The proposed pipeline effectively prevents leakage on public datasets.
Abstract
Reproducibility of AI models on biomedical data still stays as a major concern for their acceptance into the clinical practice. Initiatives for reproducibility in the development of predictive biomarkers as the MAQC Consortium already underlined the importance of appropriate Data Analysis Plans (DAPs) to control for different types of bias, including data leakage from the training to the test set. In the context of digital pathology, the leakage typically lurks in weakly designed experiments not accounting for the subjects in their data partitioning schemes. This issue is then exacerbated when fractions or subregions of slides (i.e. "tiles") are considered. Despite this aspect is largely recognized by the community, we argue that it is often overlooked. In this study, we assess the impact of data leakage on the performance of machine learning models trained and validated on multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
