bioLeak: Leakage-Aware Modeling and Diagnostics for Machine Learning in R

Sel\c{c}uk Korkmaz

arXiv:2604.10965·stat.CO·April 14, 2026

bioLeak: Leakage-Aware Modeling and Diagnostics for Machine Learning in R

Sel\c{c}uk Korkmaz

PDF

TL;DR

bioLeak is an R package designed to detect and mitigate data leakage in biomedical machine learning, ensuring more reliable model evaluation and interpretation.

Contribution

It introduces leakage-aware resampling workflows and diagnostic tools tailored for complex biomedical data with repeated measures and heterogeneity.

Findings

01

Simulation shows performance varies with leakage mechanisms.

02

Case study demonstrates different conclusions with leaky vs. guarded pipelines.

03

Software supports multiple analysis tasks with comprehensive diagnostics.

Abstract

Data leakage remains a recurrent source of optimistic bias in biomedical machine learning studies. Standard row-wise cross-validation and globally estimated preprocessing steps are often inappropriate for data with repeated measurements, study-level heterogeneity, batch effects, or temporal dependencies. This paper describes bioLeak, an R package for constructing leakage-aware resampling workflows and for auditing fitted models for common leakage mechanisms. The package provides leakage-aware split construction, train-fold-only preprocessing, cross-validated model fitting, nested hyperparameter tuning, post hoc leakage audits, and HTML reporting. The implementation supports binary classification, multiclass classification, regression, and survival analysis, with task-specific metrics and S4 containers for splits, fits, audits, and inflation summaries. The simulation artifacts show how…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.