AI slipping on tiles: data leakage in digital pathology

Nicole Bussola; Alessia Marcolini; Valerio Maggio; Giuseppe Jurman,; Cesare Furlanello

arXiv:1909.06539·q-bio.QM·November 18, 2020·ICPR Workshops

AI slipping on tiles: data leakage in digital pathology

Nicole Bussola, Alessia Marcolini, Valerio Maggio, Giuseppe Jurman,, Cesare Furlanello

PDF

TL;DR

This paper investigates how data leakage in digital pathology AI models can inflate performance metrics, demonstrates its impact through experiments, and proposes an automated pipeline to prevent leakage, enhancing reproducibility and reliability.

Contribution

It highlights the overlooked issue of data leakage in digital pathology, quantifies its effect on model performance, and introduces histolab, a Python package to create leakage-free deep learning pipelines.

Findings

01

Data leakage can inflate predictive scores by up to 41%.

02

Proper data partitioning is crucial to avoid leakage in histology data.

03

The proposed pipeline effectively prevents leakage on public datasets.

Abstract

Reproducibility of AI models on biomedical data still stays as a major concern for their acceptance into the clinical practice. Initiatives for reproducibility in the development of predictive biomarkers as the MAQC Consortium already underlined the importance of appropriate Data Analysis Plans (DAPs) to control for different types of bias, including data leakage from the training to the test set. In the context of digital pathology, the leakage typically lurks in weakly designed experiments not accounting for the subjects in their data partitioning schemes. This issue is then exacerbated when fractions or subregions of slides (i.e. "tiles") are considered. Despite this aspect is largely recognized by the community, we argue that it is often overlooked. In this study, we assess the impact of data leakage on the performance of machine learning models trained and validated on multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.