Fast Partition-Based Cross-Validation With Centering and Scaling for $\mathbf{X}^\mathbf{T}\mathbf{X}$ and $\mathbf{X}^\mathbf{T}\mathbf{Y}$

Ole-Christian Galbo Engstr{\o}m; Martin Holm Jensen

arXiv:2401.13185·cs.LG·September 29, 2025·1 cites

Fast Partition-Based Cross-Validation With Centering and Scaling for $\mathbf{X}^\mathbf{T}\mathbf{X}$ and $\mathbf{X}^\mathbf{T}\mathbf{Y}$

Ole-Christian Galbo Engstr{\o}m, Martin Holm Jensen

PDF

Open Access

TL;DR

This paper introduces fast, correct algorithms for partition-based cross-validation in machine learning models involving matrix products, supporting various preprocessing options without data leakage, and with complexity independent of the number of folds.

Contribution

The authors develop the first efficient, fold-independent cross-validation algorithms for all 16 centering/scaling combinations in models requiring $ extbf{X}^ op extbf{X}$ and $ extbf{X}^ op extbf{Y}$, ensuring no data leakage.

Findings

01

Algorithms support all centering/scaling combinations

02

Running time is independent of the number of folds

03

Preprocessing adds only a manageable constant overhead

Abstract

We present algorithms that substantially accelerate partition-based cross-validation for machine learning models that require matrix products $X^{T} X$ and $X^{T} Y$ . Our algorithms have applications in model selection for, for example, principal component analysis (PCA), principal component regression (PCR), ridge regression (RR), ordinary least squares (OLS), and partial least squares (PLS). Our algorithms support all combinations of column-wise centering and scaling of $X$ and $Y$ , and we demonstrate in our accompanying implementation that this adds only a manageable, practical constant over efficient variants without preprocessing. We prove the correctness of our algorithms under a fold-based partitioning scheme and show that the running time is independent of the number of folds; that is, they have the same time…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBlind Source Separation Techniques · Machine Learning and Data Classification · Statistical and numerical algorithms