Fast Partition-Based Cross-Validation With Centering and Scaling for $\mathbf{X}^\mathbf{T}\mathbf{X}$ and $\mathbf{X}^\mathbf{T}\mathbf{Y}$
Ole-Christian Galbo Engstr{\o}m, Martin Holm Jensen

TL;DR
This paper introduces fast, correct algorithms for partition-based cross-validation in machine learning models involving matrix products, supporting various preprocessing options without data leakage, and with complexity independent of the number of folds.
Contribution
The authors develop the first efficient, fold-independent cross-validation algorithms for all 16 centering/scaling combinations in models requiring $ extbf{X}^ op extbf{X}$ and $ extbf{X}^ op extbf{Y}$, ensuring no data leakage.
Findings
Algorithms support all centering/scaling combinations
Running time is independent of the number of folds
Preprocessing adds only a manageable constant overhead
Abstract
We present algorithms that substantially accelerate partition-based cross-validation for machine learning models that require matrix products and . Our algorithms have applications in model selection for, for example, principal component analysis (PCA), principal component regression (PCR), ridge regression (RR), ordinary least squares (OLS), and partial least squares (PLS). Our algorithms support all combinations of column-wise centering and scaling of and , and we demonstrate in our accompanying implementation that this adds only a manageable, practical constant over efficient variants without preprocessing. We prove the correctness of our algorithms under a fold-based partitioning scheme and show that the running time is independent of the number of folds; that is, they have the same time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBlind Source Separation Techniques · Machine Learning and Data Classification · Statistical and numerical algorithms
