bioLeak: Leakage-Aware Modeling and Diagnostics for Machine Learning in R
Sel\c{c}uk Korkmaz

TL;DR
bioLeak is an R package designed to detect and mitigate data leakage in biomedical machine learning, ensuring more reliable model evaluation and interpretation.
Contribution
It introduces leakage-aware resampling workflows and diagnostic tools tailored for complex biomedical data with repeated measures and heterogeneity.
Findings
Simulation shows performance varies with leakage mechanisms.
Case study demonstrates different conclusions with leaky vs. guarded pipelines.
Software supports multiple analysis tasks with comprehensive diagnostics.
Abstract
Data leakage remains a recurrent source of optimistic bias in biomedical machine learning studies. Standard row-wise cross-validation and globally estimated preprocessing steps are often inappropriate for data with repeated measurements, study-level heterogeneity, batch effects, or temporal dependencies. This paper describes bioLeak, an R package for constructing leakage-aware resampling workflows and for auditing fitted models for common leakage mechanisms. The package provides leakage-aware split construction, train-fold-only preprocessing, cross-validated model fitting, nested hyperparameter tuning, post hoc leakage audits, and HTML reporting. The implementation supports binary classification, multiclass classification, regression, and survival analysis, with task-specific metrics and S4 containers for splits, fits, audits, and inflation summaries. The simulation artifacts show how…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
