Cross-Validation for Correlated Data

Assaf Rabinowicz; Saharon Rosset

arXiv:1904.02438·stat.ME·August 10, 2021

Cross-Validation for Correlated Data

Assaf Rabinowicz, Saharon Rosset

PDF

TL;DR

This paper examines the limitations of standard cross-validation with correlated data and proposes a bias-corrected estimator, $CV_c$, to improve model evaluation and selection in such settings.

Contribution

It introduces a criterion for when standard CV is appropriate for correlated data and develops a bias correction method, $CV_c$, for cases where it is not.

Findings

01

Standard CV can be biased with correlated data.

02

The proposed $CV_c$ estimator reduces bias in prediction error estimation.

03

Numerical experiments show improved model evaluation and selection using $CV_c$.

Abstract

K-fold cross-validation (CV) with squared error loss is widely used for evaluating predictive models, especially when strong distributional assumptions cannot be taken. However, CV with squared error loss is not free from distributional assumptions, in particular in cases involving non-i.i.d. data. This paper analyzes CV for correlated data. We present a criterion for suitability of standard CV in presence of correlations. When this criterion does not hold, we introduce a bias corrected cross-validation estimator which we term $C V_{c},$ that yields an unbiased estimate of prediction error in many settings where standard CV is invalid. We also demonstrate our results numerically, and find that introducing our correction substantially improves both, model evaluation and model selection in simulations and real data studies.

Equations114

y_{i, j} = ϕ_{i, j}^{t} β + b_{i} + ϵ_{i, j}, i \in {1, ..., I}, j \in {1, ..., n_{i}},

y_{i, j} = ϕ_{i, j}^{t} β + b_{i} + ϵ_{i, j}, i \in {1, ..., I}, j \in {1, ..., n_{i}},

y = Φ β + Z s + ϵ,

y = Φ β + Z s + ϵ,

CV=\frac{1}{n}\sum_{k=1}^{K}\sum_{i\in k^{th}\,fold}L\big{(}y_{i},\hat{y}(\boldsymbol{x}_{i};T_{-k})\big{)},

CV=\frac{1}{n}\sum_{k=1}^{K}\sum_{i\in k^{th}\,fold}L\big{(}y_{i},\hat{y}(\boldsymbol{x}_{i};T_{-k})\big{)},

\displaystyle generalization\;error=\mathbb{E}_{T_{te},T_{tr}}\frac{1}{n_{te}}\sum_{i=1}^{n_{te}}L\big{(}y_{te,i},\hat{y}(\boldsymbol{x}_{te,i};T_{tr})\big{)},

\displaystyle generalization\;error=\mathbb{E}_{T_{te},T_{tr}}\frac{1}{n_{te}}\sum_{i=1}^{n_{te}}L\big{(}y_{te,i},\hat{y}(\boldsymbol{x}_{te,i};T_{tr})\big{)},

\displaystyle\{\boldsymbol{y}_{k},X_{k}\}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\{\boldsymbol{y}_{-k},X_{-k}\},\;\forall k\in\{1,...,K\}

\displaystyle\{\boldsymbol{y}_{k},X_{k}\}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\{\boldsymbol{y}_{-k},X_{-k}\},\;\forall k\in\{1,...,K\}

\{\boldsymbol{y}_{te},X_{te}\}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\{\boldsymbol{y}_{tr},X_{tr}\},

\{\boldsymbol{y}_{te},X_{te}\}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\{\boldsymbol{y}_{tr},X_{tr}\},

\displaystyle w_{cv}=\mathbb{E}_{T_{te},T_{tr}}\big{(}y_{te}-\hat{y}(\boldsymbol{x}_{te};T_{tr})\big{)}^{2}-\frac{1}{n}\mathbb{E}_{T}\|\boldsymbol{y}-\hat{\boldsymbol{y}}_{cv}(T)\|^{2},

\displaystyle w_{cv}=\mathbb{E}_{T_{te},T_{tr}}\big{(}y_{te}-\hat{y}(\boldsymbol{x}_{te};T_{tr})\big{)}^{2}-\frac{1}{n}\mathbb{E}_{T}\|\boldsymbol{y}-\hat{\boldsymbol{y}}_{cv}(T)\|^{2},

C V_{c} = C V + w_{c v} .

C V_{c} = C V + w_{c v} .

\displaystyle\mathbb{E}\big{(}y_{te}-\hat{y}(\boldsymbol{x}_{te};T_{tr})\big{)}^{2}=

\displaystyle\mathbb{E}\big{(}y_{te}-\hat{y}(\boldsymbol{x}_{te};T_{tr})\big{)}^{2}=

\displaystyle+\mathbb{E}\big{(}\mathbb{E}\,\hat{y}(\boldsymbol{x}_{te};T_{tr})-\hat{y}(\boldsymbol{x}_{te};T_{tr})\big{)}^{2}

\displaystyle+2\mathbb{E}\big{(}y_{te}-\mathbb{E}\,y_{te}\big{)}\big{(}\mathbb{E}\,\hat{y}(\boldsymbol{x}_{te};T_{tr})-\hat{y}(\boldsymbol{x}_{te};T_{tr})\big{)},

E ∥ y - \hat{y}_{c v} (T) ∥^{2} =

\displaystyle+2\mathbb{E}\big{(}\boldsymbol{y}-\mathbb{E}\,\boldsymbol{y}\big{)}^{t}\big{(}\mathbb{E}\,\hat{\boldsymbol{y}}_{cv}(T)-\hat{\boldsymbol{y}}_{cv}(T)\big{)},

w_{c v} =

w_{c v} =

\displaystyle\mathbb{E}\big{(}\hat{y}(\boldsymbol{x}_{te};T_{tr})-\mathbb{E}\,\hat{y}(\boldsymbol{x}_{te};T_{tr})\big{)}^{2}-\frac{1}{n}\mathbb{E}\|\hat{\boldsymbol{y}}_{cv}(T)-\mathbb{E}\,\hat{\boldsymbol{y}}_{cv}(T)\|^{2}

\displaystyle-2\mathbb{E}\,\mathrm{Cov}\big{(}\hat{y}(\boldsymbol{x}_{te};T_{tr}),y_{te}\big{)}+\frac{2}{n}\mathbb{E}\,\mathrm{tr}\big{[}\mathrm{Cov}\big{(}\hat{\boldsymbol{y}}_{cv}(T),\boldsymbol{y}\big{)}\big{]},

\big{(}\mathbb{E}\,y_{te}-\mathbb{E}\,\hat{y}(\boldsymbol{x}_{te};T_{tr})\big{)}^{2}=\frac{1}{n}\|\mathbb{E}\,\boldsymbol{y}-\mathbb{E}\,\hat{\boldsymbol{y}}_{cv}(T)\|^{2}.

\big{(}\mathbb{E}\,y_{te}-\mathbb{E}\,\hat{y}(\boldsymbol{x}_{te};T_{tr})\big{)}^{2}=\frac{1}{n}\|\mathbb{E}\,\boldsymbol{y}-\mathbb{E}\,\hat{\boldsymbol{y}}_{cv}(T)\|^{2}.

\displaystyle\mathbb{E}\big{(}\hat{y}(\boldsymbol{x}_{te};T_{tr})-\mathbb{E}\,\hat{y}(\boldsymbol{x}_{te};T_{tr})\big{)}^{2}

\displaystyle\mathbb{E}\big{(}\hat{y}(\boldsymbol{x}_{te};T_{tr})-\mathbb{E}\,\hat{y}(\boldsymbol{x}_{te};T_{tr})\big{)}^{2}

\displaystyle 2\mathbb{E}\,\mathrm{Cov}\big{(}\hat{y}(\boldsymbol{x}_{te};T_{tr}),y_{te}\big{)}

P_{y_{k} ∣ x_{k}, T_{- k}} = P_{y_{t e} ∣ x_{t e}, T_{t r}} .

P_{y_{k} ∣ x_{k}, T_{- k}} = P_{y_{t e} ∣ x_{t e}, T_{t r}} .

P_{y_{k} ∣ x_{k}, T_{- k}} = P_{y_{t e} ∣ x_{t e}, T_{- k}} .

P_{y_{k} ∣ x_{k}, T_{- k}} = P_{y_{t e} ∣ x_{t e}, T_{- k}} .

y =

y =

y_{t r} =

y_{t e} =

X =

X =

X_{t r} =

x_{t e} =

y =

y =

y_{t r} =

y_{t e} =

\hat{y}_{c v} (T) = H_{c v} y,

\hat{y}_{c v} (T) = H_{c v} y,

H_{cv}=\left[\begin{array}[]{cccc}0&h_{1,2}&...&h_{1,n}\\ h_{2,1}&0&&h_{2,n}\\ ...\\ h_{n,1}&h_{n,2}&...&0\end{array}\right],

H_{cv}=\left[\begin{array}[]{cccc}0&h_{1,2}&...&h_{1,n}\\ h_{2,1}&0&&h_{2,n}\\ ...\\ h_{n,1}&h_{n,2}&...&0\end{array}\right],

h_{1} = [h_{1, 2}, ..., h_{1, n}] .

h_{1} = [h_{1, 2}, ..., h_{1, n}] .

w_{cv}=\frac{2}{n}\mathbb{E}\big{[}\mathrm{tr}\Big{(}H_{cv}\mathrm{Cov}\big{(}\boldsymbol{y},\boldsymbol{y}\big{)}\Big{)}-n\boldsymbol{h}_{te}\mathrm{Cov}\big{(}\boldsymbol{y}_{tr},y_{te}\big{)}\big{]}.

w_{cv}=\frac{2}{n}\mathbb{E}\big{[}\mathrm{tr}\Big{(}H_{cv}\mathrm{Cov}\big{(}\boldsymbol{y},\boldsymbol{y}\big{)}\Big{)}-n\boldsymbol{h}_{te}\mathrm{Cov}\big{(}\boldsymbol{y}_{tr},y_{te}\big{)}\big{]}.

-2\mathbb{E}\,\mathrm{Cov}\big{(}\hat{y}(\boldsymbol{x}_{te};T_{tr}),y_{te}\big{)}+\frac{2}{n}\mathbb{E}\,\mathrm{tr}\big{[}\mathrm{Cov}\big{(}\hat{\boldsymbol{y}}_{cv}(T),\boldsymbol{y}\big{)}\big{]}=w_{cv}.

-2\mathbb{E}\,\mathrm{Cov}\big{(}\hat{y}(\boldsymbol{x}_{te};T_{tr}),y_{te}\big{)}+\frac{2}{n}\mathbb{E}\,\mathrm{tr}\big{[}\mathrm{Cov}\big{(}\hat{\boldsymbol{y}}_{cv}(T),\boldsymbol{y}\big{)}\big{]}=w_{cv}.

\displaystyle\big{(}\mathbb{E}\,y_{te}-\mathbb{E}\,\boldsymbol{h}_{te}\boldsymbol{y}_{tr}\big{)}^{2}

\displaystyle\big{(}\mathbb{E}\,y_{te}-\mathbb{E}\,\boldsymbol{h}_{te}\boldsymbol{y}_{tr}\big{)}^{2}

\displaystyle\mathbb{E}\big{(}\boldsymbol{h}_{te}\boldsymbol{y}_{tr}-\mathbb{E}\,\boldsymbol{h}_{te}\boldsymbol{y}_{tr}\big{)}^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Cross Validation for Correlated Data

Assaf Rabinowicz

Department of Statistics, Tel-Aviv University, Tel-Aviv, Israel, 69978

Saharon Rosset

Department of Statistics, Tel-Aviv University, Tel-Aviv, Israel, 69978

The authors gratefully acknowledge Israel Science Foundation, grant 1804/16

(April 05, 2020)

Cross-Validation for Correlated Data

Assaf Rabinowicz

Department of Statistics, Tel-Aviv University, Tel-Aviv, Israel, 69978

Saharon Rosset

Department of Statistics, Tel-Aviv University, Tel-Aviv, Israel, 69978

The authors gratefully acknowledge Israel Science Foundation, grant 1804/16

(April 05, 2020)

Abstract

K-fold cross-validation (CV) with squared error loss is widely used for evaluating predictive models, especially when strong distributional assumptions cannot be taken. However, CV with squared error loss is not free from distributional assumptions, in particular in cases involving non-i.i.d. data. This paper analyzes CV for correlated data. We present a criterion for suitability of standard CV in presence of correlations. When this criterion does not hold, we introduce a bias corrected cross-validation estimator which we term $CV_{c},$ that yields an unbiased estimate of prediction error in many settings where standard CV is invalid. We also demonstrate our results numerically, and find that introducing our correction substantially improves both, model evaluation and model selection in simulations and real data studies.

Keywords: Prediction Error Estimation; Model Selection; Dependent Data; Linear Mixed Model; Gaussian Process Regression

1 INTRODUCTION

Datasets with correlation structures are common in modern statistical applications in various fields, such as geostatistics (Goovaerts 1999), genetics (Maddison 1990) and ecology (Roberts et al. 2017). Different modeling methods address the correlation structure differently. Some modeling methods, such as Gaussian process regression (Rasmussen and Williams 2006, GPR) and generalized least squares (Hansen 2007, GLS), utilize explicitly the correlation structure for achieving better prediction accuracy. Other predictive models, like random forest (Breiman 2001, RF), gradient boosting machines (Friedman 2002, GBM) and other machine learning models, do not consider explicitly the correlation structure but are still potentially able to utilize the correlation implicitly. The analysis in this paper mainly focuses on correlation that appears due to latent objects, such as random effects and random fields as appear in generalized linear mixed models (Verbeke 1997, GLMM) and generalized Gaussian process regression (Rasmussen and Williams 2006, GGPR) in clustered, temporal and spatial datasets. A simple example that demonstrates the way that latent variable realizations affect the correlation structure is linear mixed models (Verbeke 1997, LMM) with random intercept for clustered data:

[TABLE]

where $y_{i,j}\in\mathbb{R}$ is the $j^{th}$ observation for cluster $i,$ $\boldsymbol{\phi}_{i,j}\in\mathbb{R}^{p}$ are the observed fixed effects covariates, $\boldsymbol{\beta}\in\mathbb{R}^{p}$ is the fixed effects coefficients vector, $b_{i}\sim N(0,\sigma_{b}^{2})$ are independent random effects and $\epsilon_{i,j}\sim N(0,\sigma_{\epsilon}^{2})$ are the i.i.d. errors. Since all the observations in cluster $i$ share the same random effect realization, they are correlated. For more information see Verbeke (1997).

When it comes to prediction, the question of whether there is correlation between the observations from the training set and the prediction set — the sample that is used for the model’s parameter estimation, and the set of points whose response is predicted based on the trained model, respectively — plays an important role. For example, in Eq. (1), if the training and prediction sets are sampled from the same clusters, then once the random effect realizations are estimated in the model training, they can be utilized for better prediction accuracy of the dependent variable in the prediction set. Under some conditions, a predictor that uses the estimated random effect realizations is the best linear unbiased predictor (BLUP), for more information see Harville et al. (1976). Another scenario is when the training and prediction sets are sampled from different clusters. In this scenario, observations of the prediction set are not correlated with observations of the training set, and therefore estimating the random effect realizations of the training set cannot be utilized for achieving better prediction accuracy. Of course, there are other correlation settings, e.g., when the random effect realizations of the training and the prediction sets are not the same but correlated. In addition to the distributional settings that are covered by GLMM and GGPR, explicit and implicit utilization of the correlation between the training and prediction sets is common in other distributional settings in various applications, including applications involving spatial datasets (Ward and Gleditsch 2018), longitudinal datasets (Hand 2017) and datasets with hierarchical clustering structure (Raudenbush and Bryk 2002).

The correlation structure of the training and prediction sets, and in particular the correlation between the training and prediction sets, can affect the model’s prediction error and therefore should be carefully addressed when estimating the prediction error. Since many model selection procedures are based on prediction error estimation, ignoring the correlation setting may also affect model selection decision.

Various prediction error measures have been used and analyzed under different correlation settings, e.g., $AIC$ -type (Vaida and Blanchard 2005) and $C_{p}$ -type (Hodges and Sargent 2001). This paper focuses on K-fold cross-validation prediction error estimator (Stone 1974, CV) which is the most widely used method for estimating prediction error (Hastie et al. 2009). We introduce a new perspective on CV and propose an applicable framework for analyzing how CV is affected by the correlation structure in various distributional settings. In addition, a bias corrected cross-validation measure, $CV_{c},$ is introduced. $CV_{c}$ is suitable for many scenarios where CV is biased due to correlations.

Section 2 presents the setting of the problem, the theoretical results of the proposed approach, and comparison with other methods. Section 3 presents numerical analyses that support the theoretical results of Section 2.

2 CV FOR CORRELATED DATA

2.1 K-fold Estimator and Generalization Error

Let $\{\boldsymbol{x}_{i}\}_{i=1}^{n},\;\boldsymbol{x}_{i}\in\mathbb{R}^{p},$ be an independent and identically distributed (i.i.d.) sample form the probability distribution function $P_{\boldsymbol{x}}$ . Let $\{y_{i}\}_{i=1}^{n},\;y_{i}\in\mathbb{R},$ be a sample that is drawn independently from the distribution $P_{y|\boldsymbol{x}=\boldsymbol{x}_{i},\boldsymbol{s}=\boldsymbol{s}_{0}},$ where $\boldsymbol{s}_{0}\in\mathbb{R}^{q}$ is the latent variable realization that induces correlation structure between $y_{i}\,^{\prime}s.$

Denote $T=\{y_{i},\boldsymbol{x}_{i}\}_{i=1}^{n}$ as the training sample for fitting a predictive model. Also denote $X=\left[\begin{array}[]{c}-\;\boldsymbol{x}_{1}^{t}-\\ \vdots\\ -\;\boldsymbol{x}_{n}^{t}-\end{array}\right]$ and $\boldsymbol{y}=\left[\begin{array}[]{c}y_{1}\\ \vdots\\ y_{n}\end{array}\right].$

One example of this setting is linear mixed models (LMM):

[TABLE]

where $\Phi\in\mathbb{R}^{n\times(p-q)}$ contains the fixed effects covariates, $Z\in\mathbb{R}^{n\times q}$ contains the random effects covariates, $\boldsymbol{s}\in\mathbb{R}^{q}$ is a normally distributed random effects vector with a general covariance matrix and $\boldsymbol{\epsilon}\in\mathbb{R}^{n}$ is the normally distributed error term with a general covariance matrix. In this setting $X=\{\Phi,Z\}.$ Eq. (1) is a special case of this setting. Extensions of this model are generalized linear mixed models (Breslow and Clayton 1993, GLMM), where $\boldsymbol{y}=g^{-1}(\Phi\boldsymbol{\beta}+Z\boldsymbol{s}+\boldsymbol{\epsilon})$ and $g$ is the link function, as well as hierarchical generalized linear models (Lee and Nelder 1996, HGLM) where $\boldsymbol{s}$ and $\boldsymbol{\epsilon}$ do not necessarily follow the normal distribution. Other examples of this setting that are commonly analyzed and represented using graphical probabilistic models tools are hidden Markov models and mixture models (Jordan 2004).

Once a model is fitted, it is natural to evaluate the prediction ability of the model. In CV prediction error estimator, $T$ is randomly partitioned into K folds, $T_{1}=\{y_{i},\boldsymbol{x}_{i}\}_{i=1}^{n_{1}},$ $T_{2}=\{y_{i},\boldsymbol{x}_{i}\}_{i=(n_{1}+1)}^{n_{2}},...,T_{K}=\{y_{i},\boldsymbol{x}_{i}\}_{i=(n_{K-1}+1)}^{n_{K}},$ 111We assume the observations are randomly ordered, so the folds are exchangeable. where $n_{k}-n_{k-1}$ is the sample size in fold $k$ and $n_{K}=n.$ For each $k\in\{1,...,K\},$ the model is trained on the entire data except the $k^{th}$ fold, denoted by $T_{-k}=\bigcup_{j\neq k}T_{j}=\{\boldsymbol{y}_{-k},X_{-k}\}.$ The prediction error of the trained model is then measured on the holdout fold, $T_{k}=\{\boldsymbol{y}_{k},X_{k}\},$ with respect to some loss function $L(\cdot\,,\cdot):\;\mathbb{R}\times\mathbb{R}\to\mathbb{R}.$ Thus, each fold does not train the model used for predicting its outcome. The CV prediction error estimator is calculated by averaging out the estimated prediction error across all the folds, i.e.,

[TABLE]

where $\hat{y}(\boldsymbol{x}_{i};T_{-k})$ is the predictor of $y_{i},$ constructed by training with $T_{-K}$ and predicting on $\boldsymbol{x}_{i}.$ A special case of CV is leave one out CV (Stone 1974, LOO), which is defined by setting $K=n,$ i.e., each observation defines a fold and therefore the model is trained on $n-1$ observations and its prediction error is evaluated on a single observation. Under some conditions, LOO is superior to other CV variants in terms of minimizing asymptotic bias and instability (for more information see Burman (1989); Arlot et al. (2010)), but can be computationally prohibitive in some settings. A comprehensive comparison of CV variants in terms of prediction error estimation and models selection can be found in Zhang and Yang (2015).

In case of i.i.d. sampling, CV is considered to be an estimator of the generalization error, which is defined here in a wide sense that also covers settings with correlated data:

[TABLE]

where

•

$T_{tr}=\{\boldsymbol{y}_{tr},X_{tr}\}=\{y_{tr,i},\boldsymbol{x}_{tr,i}\}_{i=1}^{n_{tr}}$

–

$\boldsymbol{x}_{tr,i}$ is an i.i.d. sample from $P_{\boldsymbol{x}}$

–

$y_{tr,i}$ is sampled from $P_{y|\boldsymbol{x}=\boldsymbol{x}_{tr,i},\boldsymbol{s}=\boldsymbol{s}_{0}}$

•

$T_{te}=\{\boldsymbol{y}_{te},X_{te}\}=\{y_{te,i},\boldsymbol{x}_{te,i}\}_{i=1}^{n_{te}}$

–

$\boldsymbol{x}_{te,i}$ is an i.i.d. sample from $P_{\boldsymbol{x}}$

–

$y_{te,i}$ is sampled from $P_{y|\boldsymbol{x}=\boldsymbol{x}_{te,i},\boldsymbol{s}=\boldsymbol{s}_{te}}$

The reason that CV is commonly considered to be an estimator of the generalization error is the random mechanism that is embedded in the CV procedure. In this perspective, $T_{k}$ and $T_{-k}$ are equivalent to $T_{te}$ and $T_{tr},$ respectively. Then, averaging the prediction error, $\sum_{i\in k^{th}\,fold}L\big{(}y_{i},\hat{y}(\boldsymbol{x}_{i};T_{-k})\big{)}/\big{(}n_{k}-n_{k-1}\big{)},$ over the different folds, estimates Eq. (2).

It is important to emphasize that unlike $T,$ which is the available dataset for modeling, $\{T_{tr},T_{te}\}$ are used in the paper for demonstrating different prediction goals, rather than actual datasets.

Remark 2.1.

Note, it is typically assumed that $T_{tr}$ is distributed as $T$ and therefore of size $n.$ In this case, the size of $T_{-k}$ and $T_{tr}$ is obviously different, and additional bias in CV evaluation is introduced. This is typically ignored, especially when considering LOO and assuming training with $n-1$ or $n$ observations carries little difference. In what follows we also ignore this and implicitly assume that $T_{tr}$ is of the same size as $T_{-k}$ for all $k$ .

When

[TABLE]

and

[TABLE]

it is clear that CV is an unbiased estimator of the generalization error, however a careful analysis is required for the case when the folds are dependent.

Next we investigate how deviating from the condition in Eq. (3) contributes a bias to CV with respect to the generalization error. Based on it, a bias corrected CV estimator, $CV_{c},$ will be presented.

In what follows we limit the discussion to squared error loss function. Also note, from now on LOO setting will be assumed, and consequently $n_{te}=1,n_{tr}=n-1,$ however the results are valid to other CV partitioning settings. In particular, it can easily be seen that the size of the test set, $n_{te},$ has no bearing on generalization error definition.

2.2 A General Formulation of CV Bias

Let

[TABLE]

where $\hat{\boldsymbol{y}}_{cv}(T)=\big{(}\hat{y}(\boldsymbol{x}_{1};T_{-1}),...,\hat{y}(\boldsymbol{x}_{n};T_{-n})\big{)}^{t}$ is the CV predictor of $\boldsymbol{y}.$

An unbiased estimator of the generalization error is:

[TABLE]

Before analyzing different correlation settings, we derive a more explicit expression for $w_{cv}.$ For simplicity, subscript notations in the expectation operator are frequently omitted. Therefore, unless a specific object is specified, $\mathbb{E}$ averages all the random variables that it operates on.

Since

[TABLE]

and $y_{i}$ and $y_{te}$ have the same marginal distribution, which gives $\mathbb{E}\|\boldsymbol{y}-\mathbb{E}\,\boldsymbol{y}\|^{2}/n-\mathbb{E}\big{(}y_{te}-\mathbb{E}\,y_{te}\big{)}^{2}=0,$ then:

[TABLE]

where $\mathrm{tr}$ is the trace operator and $\mathrm{Cov}$ is the covariance operator which contains conditional expectation of the dependent variables $\boldsymbol{y},\;y_{te}$ and $\boldsymbol{y}_{tr}$ given their covariates, $X,\;\boldsymbol{x}_{te}$ and $X_{tr}$ e.g., $\mathbb{E}\,\mathrm{tr}\big{[}\mathrm{Cov}\big{(}\hat{\boldsymbol{y}}_{cv}(T),\boldsymbol{y}\big{)}\big{]}\coloneqq\mathbb{E}_{X}\mathrm{tr}\big{[}\mathrm{Cov}\big{(}\hat{\boldsymbol{y}}_{cv}(T),\boldsymbol{y}|X\big{)}\big{]}.$

The first two lines in Eq. (5) are the differences between the bias and the variance of $\hat{\boldsymbol{y}}_{cv}(T)$ and $\hat{y}(\boldsymbol{x}_{te};T_{tr}),$ where the expectation is taken also over the covariates. The third line relates to the covariances between the response and its predictor in each scheme — CV prediction error and generalization error.

2.3 Criterion for CV Unbiasedness

Let $P_{T_{te},T_{tr}}$ and $P_{T_{k},T_{-k}}$ be the joint distributions of $\{T_{te},T_{tr}\}$ and $\{T_{k},T_{-k}\},$ respectively. Theorem 2.1 describes a simple generic condition when no correction is required for CV.

Theorem 2.1.

If $P_{T_{te},T_{tr}}=P_{T_{k},T_{-k}}\;\forall k\in\{1,..,n\},$ then $w_{cv}=0.$

Proof.

Since $\{T_{k},T_{-k}\}$ were drawn from the same distribution as $\{T_{te},T_{tr}\},$ then an expectation over any transformation of them is equal, in particular:

[TABLE]

Similarly

[TABLE]

∎

Theorem 2.1 states a very basic and intuitive condition of CV unbiasedness — When the CV partitioning preserves the distributional relation between the prediction set to the training set, then CV is unbiased.

Let $P_{y_{k}|\boldsymbol{x}_{k},T_{-k}},\;P_{y_{te}|\boldsymbol{x}_{te},T_{tr}}$ and $P_{y_{te}|\boldsymbol{x}_{te},T_{-k}}$ denote the respective conditional distribution. The condition in Theorem 2.1 can be compacted to the following one:

[TABLE]

Moreover, since $T_{tr}$ is assumed to be distributed as $T_{-k},$ then this condition can be rewritten as follows:

[TABLE]

Of course, when $w_{cv}=0$ the CV is unbiased and therefore suitable. Based on Theorem 2.1, it is important to stress that CV is suitable not only for the case that $\{y_{k},\boldsymbol{x}_{k}\}_{k=1}^{n}$ are independent, neither only for the case when $\{y_{k},\boldsymbol{x}_{k}\}_{k=1}^{n}$ are exchangeable, as is commonly mistakenly referred in the literature (Anderson et al. 2018; Roberts et al. 2017). The biasedness of CV only relates to the question whether $T_{-k}$ contributes more information for predicting $y_{k}$ than $T_{tr}$ contributes for predicting $y_{te}.$

We can demonstrate the use of Theorem 2.1 for a simple application — using LMM for predicting new observations from the same clusters that appear in the training set. In the case $\boldsymbol{s}_{te}=\boldsymbol{s}_{0},$ i.e.,

[TABLE]

and

[TABLE]

$\boldsymbol{s}_{0}\in\mathbb{R}^{q}$ is the random effect realization vector, where each entry is a random effect realization for a different cluster and $\boldsymbol{\epsilon}\in\mathbb{R}^{n},\;\boldsymbol{\epsilon}_{tr}\in\mathbb{R}^{(n-1)},\;\epsilon_{te}\in\mathbb{R}$ are i.i.d. normal errors terms.

As was mentioned previously, the observations in $X,\;X_{tr}$ and $\boldsymbol{x}_{te}$ are i.i.d.. Also, in this example $\boldsymbol{y},\;\boldsymbol{y}_{tr}$ and $y_{te}$ were drawn given the same latent variable realization, $\boldsymbol{s}_{0},$ therefore Theorem 2.1’s condition — $P_{y_{k}|\boldsymbol{x}_{k},T_{-k}}=P_{y_{te}|\boldsymbol{x}_{te},T_{-k}}$ — is satisfied and $w_{cv}=0.$ This use case of predicting new points from the same clusters that were used in the training data is common, for example see Gelman (2006).

The principle of CV suitability for the setting in Eq. (7) is discussed in the LMM literature (Fang 2011; Little et al. 2017), however we did not find any general mathematical formalization of it. Commonly, CV is avoided in applications involving correlated data based on the wrong perception that CV is always unsuitable for these cases (Anderson et al. 2018; Roberts et al. 2017). It is also important to stress that since the condition in Theorem 2.1 only relates to the distributional relation between $\{T_{te},T_{tr}\}$ and $\{T_{k},T_{-k}\}$ rather than specifying a distribution, Theorem 2.1 can be implemented in applications where the distributional settings are not fully specified, as is common when implementing machine learning algorithms.

2.4 CV Correction

Now, consider the setting where Theorem 2.1’s condition is not satisfied, i.e., when $P_{T_{te},T_{tr}}\neq P_{T_{k},T_{-k}}.$ A simple scenario for this setting can again be taken from LMM for clustered data, where $T_{te}$ contains a new latent random effects realization i.e.,

[TABLE]

where $\boldsymbol{s}_{te}\neq\boldsymbol{s}_{0}.$ Other relevant components are defined in the same way as in Eq. (7). In this scenario the correlation between $\boldsymbol{y}_{tr}$ and $y_{te}$ is different than the correlation between $\boldsymbol{y}_{-k}$ and $y_{k},$ and a further analysis of $w_{cv}$ is required. This can occur for example when the random effects are intercepts for clusters (e.g., cities), and the data was collected at one point of time, but the prediction task is performed on the same clusters at another point of time.

Here, first we will find an estimator of $w_{cv}$ when $\hat{\boldsymbol{y}}_{cv}(T)$ is linear in $\boldsymbol{y},$ then $w_{cv}$ for nonlinear predictors will be discussed as well. However, before that, let us demonstrate how $\hat{\boldsymbol{y}}_{cv}(T)$ can be formalized linearly in $\boldsymbol{y}$ for some models.

Definition 2.1.

$\hat{\boldsymbol{y}}_{cv}(T)$ * is linear in $\boldsymbol{y}$ if:*

[TABLE]

where $H_{cv}\in\mathbb{R}^{n\times n}$ does not contain $\boldsymbol{y}$ and is constructed as follows:

[TABLE]

$h_{k,k^{\prime}}\in\mathbb{R}\;\forall k,k^{\prime}\in\{1,...,n\}.$ **

The cross validation’s principle that $y_{k}$ is not involved in predicting itself is reflected by having zeros in the diagonal, therefore $\hat{y}(\boldsymbol{x}_{k},T_{-k})=\boldsymbol{h}_{k}\boldsymbol{y}_{-k},$ where $\boldsymbol{h}_{k}\in\mathbb{R}^{1\times n-1}$ is the $k^{th}$ row of $H_{cv}$ without the diagonal element, e.g.,

[TABLE]

Of course, $H_{cv}$ can be defined for other K-fold CV settings by adjusting the dimension of the blocks with the zeros on the diagonal (in LOO the dimension of each block is $1\times 1,$ however for general K-fold CV is $n/K\times n/K$ ).

Examples for linear models are ordinary least squares (OLS), generalized least squares (GLS), ridge regression, smoothing splines, LMM, Gaussian process regression (GPR) and kernel regression.

Definition 2.2.

Let $\boldsymbol{h}_{te}\in\mathbb{R}^{1\times n-1}$ be the hat vector of $y_{te},$ constructed from $\{X_{tr},\mathrm{Cov}(y_{te},\boldsymbol{y}_{tr})\}$ in the same way as $\boldsymbol{h}_{k}$ is constructed from $\{X_{-k},\mathrm{Cov}(y_{k},\boldsymbol{y}_{-k})\}.$

By definition 2.2, $\boldsymbol{h}_{te}$ is the equivalent of $\boldsymbol{h}_{k}$ for the set $\{T_{te},T_{tr}\}$ and they are distributed the same.

Theorem 2.2.

Let $\hat{\boldsymbol{y}}_{cv}(T)=H_{cv}\boldsymbol{y}$ be a linear predictor of $\boldsymbol{y},$ and $\hat{y}_{te}(\boldsymbol{x}_{te},T_{tr})=\boldsymbol{h}_{te}\boldsymbol{y}_{tr}$ is its corresponding predictor of $y_{te}.$ Then:

[TABLE]

Proof.

Based on Eq. (5) and by assuming linear predictor,

[TABLE]

Therefore, it is only left to show that

[TABLE]

Since, by definition $y_{k}$ and $y_{te}$ were drawn from the same distortion, and $\boldsymbol{y}_{-k}$ and $\boldsymbol{y}_{tr}$ were drawn from the same distribution, then although $P_{T_{te},T_{tr}}\neq P_{T_{k},T_{-k}},$ still

[TABLE]

Therefore Eq. (10) holds. Similarly with Eq. (11). ∎

Using Theorem 2.2, given $\mathrm{Cov}\big{(}\boldsymbol{y},\boldsymbol{y}\big{)}$ and $\mathrm{Cov}\big{(}\boldsymbol{y}_{tr},y_{te}\big{)},$ an estimator of the generalization error for a linear predictor is:

[TABLE]

A special case is when $\mathrm{Cov}\big{(}\boldsymbol{y}_{tr},y_{te}\big{)}=0.$ For example, in the clustered LMM setting that was given in Eq. (8), $\mathrm{Cov}\big{(}\boldsymbol{y}_{tr},y_{te}\big{)}=0$ when the latent variable realizations of $\boldsymbol{y}_{tr}$ and $y_{te},$ i.e., $\boldsymbol{s}_{0}$ and $\boldsymbol{s}_{te},$ are independent. In this case:

[TABLE]

In case $\hat{\boldsymbol{y}}_{cv}(T)$ is not a linear predictor and Theorem 2.1’s condition does not hold, then there is no closed form for $w_{cv}.$ However, a correction is required as CV is not an unbiased estimator of the prediction error in this case. See Section 2.6 for some ad-hoc solutions that may apply in specific cases.

It is important to note that:

•

Theorem 2.2, as well as Eqs. (12) and (13), are valid for any K-fold setting and not only for LOO. Of course, $H_{cv}$ is based on $K$ and should be adjusted correspondingly.

•

The correction term in $\widehat{CV_{c}}$ depends on covariance matrices, which are commonly estimated. The effect of using estimated covariance matrices instead of the true covariance matrices is discussed and analyzed in Sections 2.7 and 3.

In case there are several alternative models, it is common to use their estimated prediction errors for selecting the best model.

Definition 2.3.

Given a set of models $\mathcal{H},$ the best $\widehat{CV_{c}}$ model is:

[TABLE]

where $\widehat{CV_{c}}(h)$ is $\widehat{CV_{c}}$ for model $h.$

2.4.1 Specifying $\boldsymbol{h}_{te}$ and $\mathrm{Cov}\big{(}\boldsymbol{y}_{tr},y_{te}\big{)}$

The term $n\boldsymbol{h}_{te}\mathrm{Cov}\big{(}\boldsymbol{y}_{tr},y_{te}\big{)}$ in Eq. (12) is relevant when $\boldsymbol{y}_{tr}$ and $y_{te}$ are correlated (but not in the same way as $y_{k}$ and $\boldsymbol{y}_{-k})$ e.g., assuming $T=\{X,\boldsymbol{y}\}$ has a hierarchical clustering structure:

[TABLE]

Assume that the prediction goal is to estimate:

[TABLE]

where $u_{te}\in[u_{1},...,u_{I}],$ however $b_{te}$ does not depend on the realizations of $\{b_{i,j}\}_{i\in\{1,...I\},\;j\in\{1,...,J\}}.$ In words, the prediction goal is to predict a new observation that relates to one of the high-level clusters (indexed by $i$ ), but from a new low-level cluster (indexed by $j$ ).

In this setting, since the realization of $u_{te}$ appears in $\boldsymbol{y},$ then $\mathrm{Cov}(\boldsymbol{y}_{tr},y_{te})\neq 0$ and $n\boldsymbol{h}_{te}\mathrm{Cov}\big{(}\boldsymbol{y}_{tr},y_{te}\big{)}$ is required for calculating $\widehat{CV_{c}}.$

Although $T_{te}$ is an abstract object, $\boldsymbol{h}_{te}$ and $\mathrm{Cov}(\boldsymbol{y}_{tr},y_{te})$ can be extracted from the data. Since $\boldsymbol{h}_{te}$ is distributed as $\boldsymbol{h}_{k}$ (for random k), then $\boldsymbol{h}_{te}$ can be extracted by selecting a random $\boldsymbol{h}_{k}$ from $H_{cv}.$ $\mathrm{Cov}(\boldsymbol{y}_{tr},y_{te})$ is extracted correspondingly as $\mathrm{Cov}(\boldsymbol{y}_{-k},y_{k}|\boldsymbol{b},\boldsymbol{\epsilon}),$ where $\boldsymbol{b}$ and $\boldsymbol{\epsilon}$ are vectors containing $\{b_{i,j}\}_{i\in\{1,...I\},\;j\in\{1,...,J\}}$ and $\{\epsilon_{i,j,r}\}_{i\in\{1,...I\},\;j\in\{1,...,J\},\;r\in\{1,...,R\}},$ respectively. By conditioning on these vectors, the remaining covariance stems from $\boldsymbol{u}=[u_{1},...,u_{I}]$ only, corresponding to the covariance between $\boldsymbol{y}_{tr}$ and $y_{te}.$

Simpler, $n\boldsymbol{h}_{te}\mathrm{Cov}\big{(}\boldsymbol{y}_{tr},y_{te}\big{)}$ can be replaced by $\mathrm{tr}\big{(}H_{cv}\mathrm{Cov}(\boldsymbol{y},\boldsymbol{y}|\boldsymbol{b},\boldsymbol{\epsilon})\big{)},$ such that:

[TABLE]

2.4.2 Interpretation of the Results

The correction $2/n\times\Big{[}\mathrm{tr}\Big{(}H_{cv}\mathrm{Cov}\big{(}\boldsymbol{y},\boldsymbol{y}\big{)}\Big{)}-n\boldsymbol{h}_{te}\mathrm{Cov}\big{(}\boldsymbol{y}_{tr},y_{te}\big{)}\Big{]}$ is intuitive since it expresses the difference between the correlation structure of the target prediction problem and the correlation structure in the available dataset, $T.$

Theorems 2.1 and 2.2 emphasize that the question whether CV is biased relates to the distributional setting and it is indifferent to the implemented algorithm. The implemented algorithm is expressed only by the values of $H_{cv}$ in $\widehat{CV_{c}}.$

An interesting example that stresses this understanding is when generalized least squares (GLS) is implemented in a use case with a correlation setting of $\mathrm{Cov}\big{(}\boldsymbol{y}_{tr},y_{te}\big{)}=0.$ Since GLS estimates $\mathbb{E}(\boldsymbol{y}|X)$ and does not utilize explicitly the random effects for achieving better prediction accuracy — as LMM does by estimating $\mathbb{E}(\boldsymbol{y}|X,\boldsymbol{s})$ — then one may think that CV is unbiased when GLS is implemented, regardless of the correlation structure of $\mathrm{Cov}\big{(}\boldsymbol{y}_{tr},y_{te}\big{)}.$ However, as was mentioned, this is wrong since the CV biasedness relates to the distributional setting rather than to the implemented algorithm. The bias in this case is:

[TABLE]

If $\mathrm{Cov}(\boldsymbol{y},\boldsymbol{y})=\sigma^{2}_{\epsilon}I+\rho\mathbbm{1}_{n\times n},$ where $\mathbbm{1}_{n\times n}$ is a $n$ by $n$ matrix of ones and $\rho\in\mathbb{R}^{+},$ then:

[TABLE]

where the identity $\mathrm{Cov}\big{(}\boldsymbol{y}_{-k},\boldsymbol{y}_{-k}\big{)}^{-1}\mathrm{Cov}\big{(}\boldsymbol{y}_{-k},y_{k}\big{)}=\mathbbm{1}_{n-1}\rho/\Big{(}\sigma^{2}_{\epsilon}+\rho\big{(}n-1\big{)}\Big{)}$ is based on the result by Miller (1981).

The intuition behind the biasedness of CV in this setting is that although GLS does not utilizes explicitly estimated random effect realizations, its estimated model coefficients are still affected by the random effect realizations in $T.$ Similarly, if GLS was replaced by OLS, CV would still be biased under this correlation setting.

It is also important to emphasize that Theorem 2.1 derives $w_{cv}$ explicitly ( $w_{cv}=0$ ) for any model under its assumed conditions, however $\widehat{CV_{c}}$ uses an approximated $w_{cv},$ which is only relevant for linear models.

2.4.3 Comparison with Expected Optimism

Below, a comparison between the correction in $\widehat{CV_{c}}$ and the expected optimism correction (Efron 1986) is presented.

Expected optimism correction was developed in a context of in-sample prediction error measure:

[TABLE]

where $\boldsymbol{y}^{*}\in\mathbb{R}^{n}$ is identically distributed but independent copy of $\boldsymbol{y}.$ The in-sample prediction error is estimated by

[TABLE]

where $w$ is the expected optimism:

[TABLE]

If $\hat{\boldsymbol{y}}(X;T)=H\boldsymbol{y},$ for some hat matrix $H,$ then

[TABLE]

The similarity between $w$ and the correction in $\widehat{CV_{c}}$ in case $\mathrm{Cov}\big{(}\boldsymbol{y}_{tr},y_{te}\big{)}=0,$ i.e., in case $w_{cv}=\mathbb{E}\Big{(}2/n\times\mathrm{tr}\big{(}H_{cv}\mathrm{Cov}(\boldsymbol{y},\boldsymbol{y})\big{)}\Big{)},$ is interesting since it reflects the relation between generalization error and in-sample error and emphasizes the role of the linearity in this relation.

The fundamental difference between in-sample error and generalization error is that in the latter, the covariates matrices, $X_{tr},\;\boldsymbol{x}_{te}$ are assumed to be random variables and therefore in generalization error, unlike in in-sample prediction error:

$X_{tr}$ and $\boldsymbol{x}_{te}$ are not identical. 2. 2.

An expectation is taken also over $\{X_{tr},\;\boldsymbol{x}_{te}\}.$

As was mentioned in the previous sections, the inner-sampling mechanism of $\{T_{k},T_{-k}\}$ in CV that emulates repeated sampling of $\{T_{te},T_{tr}\}$ from $\{P_{\boldsymbol{x}},P_{y|\boldsymbol{x},\boldsymbol{s}}\}$ expresses these properties. These properties are also reflected in the correction. Since

[TABLE]

then $2/n\times\mathrm{tr}\Big{(}H_{cv}\mathrm{Cov}\big{(}\boldsymbol{y},\boldsymbol{y}\big{)}\Big{)}$ averages $n$ identically distributed atoms, $\boldsymbol{h}_{k}\mathrm{Cov}(y_{k},\boldsymbol{y}_{-k}),$ where each one of them is an unbiased estimator of $w_{cv}.$ Unlike in $w,$ which relates to a specific covariates matrix realization, $X,$ the atoms in $2/n\times\mathrm{tr}\Big{(}H_{cv}\mathrm{Cov}\big{(}\boldsymbol{y},\boldsymbol{y}\big{)}\Big{)}$ relate to different covariates realizations, $\{X_{-k},\boldsymbol{x}_{k}\}_{k=1}^{n}.$

In addition, when $\mathrm{Cov}(\boldsymbol{y},\boldsymbol{y})=\sigma^{2}_{\epsilon}I,$ while $w_{cv}=0,$ $w=2\sigma^{2}_{\epsilon}/n\times\mathrm{tr}(H).$ This is since in CV the sample is partitioned into training and test, however in expected optimism approach the whole sample is used for both tasks — training and test.

2.5 Advanced Correlation Settings

2.5.1 Kriging

Many applications with spatial and temporal data are analyzed using random functions framework, rather than multivariate random variable framework. A comprehensive review about random functions data analysis can be found in Wang et al. (2016). Here we focus on interpreting the results from Sections 2.3 and 2.4 in the context of a specific use case in the random functions framework — Kriging with Gaussian process regression (Rasmussen and Williams 2006, GPR). Numerical analysis of $\widehat{CV_{c}}$ implementation in real spatial dataset is presented in Section 3.2.2.

In Kriging (Goovaerts 1999) the goal is to create a climate map on some surface, $\mathbb{A},$ using climate predictions at a high-resolution grid of the surface. The predictions at the grid points are based on a predictive model that was fitted to a sample, $T,$ that was drawn from this surface, but covers the surface sparsely. In many cases, GPR is the predictive modeling method that is used for Kriging. In this method, as well as in other functional data analysis methods, the mean and the covariance of the predicted variable are formulated as functions. The mean function typically depends on fixed effects of some covariates (like elevation). The estimated mean function in GPR is a linear function of $\boldsymbol{y}.$ The covariance function, which is termed the kernel function, $\mathcal{K_{\mathbb{A}}}:(\mathbb{A}\times\mathbb{A})\to\mathbb{R},$ measures the covariance between each two points in $\mathbb{A},$ whether the points are in the sample $T$ or not. Unlike in the multivariate approach, where the correlation is induced by a latent random variable realization, in the functional approach the correlation structure of the surface $\mathbb{A},$ as it is expressed by $\mathcal{K}_{\mathbb{A}},$ is induced by realization of a stochastic process instance — latent random function, $\boldsymbol{s}.$ Since Theorems 2.1 and 2.2 are based on the relation between $P_{y_{k}|\boldsymbol{x}_{k},T_{-k}}$ and $P_{y_{te}|\boldsymbol{x}_{te},T_{-k}},$ rather than whether the source of the correlation between the observations is a latent random variable or latent random function, these theorems can also be applied here.

Let us consider three scenarios. The first scenario is the classical Kriging use case, where the observations of both samples, $T_{te}$ and $T,$ are randomly sampled from the same surface, $\mathbb{A},$ i.e., observations of both samples are drawn independently from $\{P_{\boldsymbol{x}},\;P_{y|\boldsymbol{x},\boldsymbol{s}=\boldsymbol{s}_{0}}\},$ where $\boldsymbol{s}_{0}$ is the realization of the latent random function $\boldsymbol{s}$ in the surface $\mathbb{A}.$ In this case Theorem 2.1’s condition is satisfied and therefore $w_{cv}=0$ and CV is suitable.

The second scenario is when the realization of $\boldsymbol{s}$ is not the same in $T_{te}$ and $T,$ and therefore while the observations of $T$ follow $\{P_{\boldsymbol{x}},\;P_{y|\boldsymbol{x},\boldsymbol{s}=\boldsymbol{s}_{0}}\},$ the observation in $T_{te}$ follows $\{P_{\boldsymbol{x}},\;P_{y|\boldsymbol{x},\boldsymbol{s}=\boldsymbol{s}_{te}}\}.$ In this case Theorem 2.1’s condition is not satisfied. An example for this scenario is when $T_{te}$ is sampled from the same surface as $T,$ $\mathbb{A},$ however at a future time-point (e.g., when the goal it to create a climate map for the next year based on this year’s data). In this case, $y_{k}$ and $\boldsymbol{y}_{-k},$ which are sampled at the same time point, are more correlated than $y_{te}$ and $\boldsymbol{y}_{-k},$ which are sampled at different time-points. Therefore, CV is biased. As we saw in Section 2.4, if a linear model (such as GPR) is used, then $\widehat{CV_{c}}$ is an unbiased estimator of the generalization error and therefore should be used instead of CV.

Another spatial application in this scenario is when $T_{te}$ is sampled from the surface $\mathbb{A}^{\prime},$ which is different than $\mathbb{A}.$ Since the surfaces are different, then their latent random function realizations are different. Therefore, assuming observations in both samples, $T$ and $T_{te},$ were drawn from the same marginal distribution — $\{P_{\boldsymbol{x}},\;P_{y|\boldsymbol{x}}\}$ —, then $\widehat{CV_{c}}$ should be used instead of CV.

Another interesting scenario that is not covered either by Theorem 2.1 or by Theorem 2.2 is when $\{y_{k},\boldsymbol{x}_{k}\}_{k=1}^{n}$ and $\{y_{te},\boldsymbol{x}_{te}\}$ are drawn from different marginal distributions — $\{P_{\boldsymbol{x}},\;P_{y|\boldsymbol{x}}\}$ and $\{P^{te}_{\boldsymbol{x}},\;P^{te}_{y|\boldsymbol{x}}\},$ respectively. An example for this scenario is when Kriging is used for predicting extrapolated spatial points with respect to the sample $T.$ It may happen due to sampling challenges, such as sampling from mountainous and deep marine regions (Rabinowicz and Rosset 2020). This scenario, which violates the setting that is assumed in Theorems 2.1 and 2.2 requires further research.

2.5.2 Longitudinal Data

Another common setting with correlation structure is longitudinal data — where there are several subjects that are repeatedly observed over time. In this setting, due to the temporal orientation, the correlation structure is more complicated than in a simple clustered data.

Let us consider three scenarios that are equivalent to the three scenarios given in Section 2.5.1. However, unlike in Section 2.5.1, the multivariate framework would be considered (rather than the functional framework).

The first scenario is when the prediction goal is to predict a new observation, $T_{te},$ of one of the subjects in $T,$ sampled at a random time-point from the same distribution that the time-points of the observations in $T$ follow. Since $T_{te}$ and $T$ are sampled from the same subjects, then $\boldsymbol{s}_{te}=\boldsymbol{s}_{0}$ and therefore $P_{y_{te}|\boldsymbol{x}_{te},T_{-k}}=P_{y_{k}|\boldsymbol{x}_{k},T_{-k}}.$ By Theorem 2.1 this gives $w_{cv}=0,$ and therefore CV is suitable in this case.

The second scenario is when the prediction goal is predicting a new observation that relates to a subject that is not in $T$ and therefore $\boldsymbol{s}_{te}\neq\boldsymbol{s}_{0}$ and $P_{y_{te}|\boldsymbol{x}_{te},T_{-k}}\neq P_{y_{k}|\boldsymbol{x}_{k},T_{-k}}.$ For this scenario, given that $\{y_{k},\boldsymbol{x}_{k}\}$ and $\{y_{te},\boldsymbol{x}_{te}\}$ were sampled from the same marginal distribution, $\{P_{\boldsymbol{x}},\;P_{y|\boldsymbol{x}}\},$ and a linear model is implemented, then by Theorem 2.2, $\widehat{CV_{c}}$ should be used instead of CV. Numerical analysis of $\widehat{CV_{c}}$ implementation in this scenario is presented in Section 3.1.

Another scenario is when $\{y_{k},\boldsymbol{x}_{k}\}$ and $\{y_{te},\boldsymbol{x}_{te}\}$ are sampled from different marginal distributions, $\{P_{\boldsymbol{x}},P_{y|\boldsymbol{x}}\}$ and $\{P^{te}_{\boldsymbol{x}},\;P^{te}_{y|\boldsymbol{x}}\},$ respectively. The marginal distributions can be different due to various reasonable prediction goals, such as forcing the data point in $T_{te}$ to extrapolates the data points in $T$ with respect to the time variable, which results in $\boldsymbol{x}_{te}$ and $\boldsymbol{x}_{k}$ being nonidentically distributed. This scenario violates the assumptions in Theorems 2.1 and 2.2 and therefore requires further research.

2.6 Comparison with Other Methods

Several cross-validation variants were proposed for settings involving correlated data. Some of them were proposed from a perspective that correlation between the folds causes K-fold CV to underestimate the generalization error. As was shown above, this perception is wrong in many scenarios. Other variants are relevant for very specific applications under various sampling restrictions. Below, several cross-validation variants are described and compared to $\widehat{CV_{c}}.$

One method is h-blocking (Burman et al. 1994), which is mainly relevant for spatial data. In h-blocking, in order to reduce the correlation between the folds, the analyzed surface is partitioned into blocks (folds) that are separated from each other by some distance, $h.$ As was described above, many use cases do not require any correlation reduction between the folds and K-fold CV is suitable, however the use of h-blocking is often recommended regardless of the predictive problems setup (Roberts et al. 2017). Let us focus on a scenario when $P_{y_{te}|\boldsymbol{x}_{te},T_{-k}}\neq P_{y_{k}|\boldsymbol{x}_{k},T_{-k}},$ and therefore the condition in Theorem 2.1 is not satisfied, causing K-fold CV to be biased. In this scenario, although the h-blocking approach may seem reasonable, in fact, it suffers from several issues that do not affect $\widehat{CV_{c}}.$ For example, frequently, creating the separation between the folds requires omitting observations from the training sample. In addition, the folds that are generated by h-blocking have different distributions, in particular their distributions are different than $P_{T_{te}}.$ Therefore, h-blocking may provide a biased prediction error estimator with respect to the generalization error. Moreover, since some of the blocks are at the edge of the surface, then the prediction of those blocks becomes predicting spatial extrapolation, which might be inaccurate and does not reflect the planned prediction problem. This implication can affect dramatically the prediction error estimate (Roberts et al. 2017).

Another method is leave cluster out (Rice and Silverman 1991, LCO). This method is relevant for the case when $\mathrm{Cov}(\boldsymbol{y}_{tr},y_{te})=0$ and the training set, $T,$ has a clustered correlation structure. LCO eliminates the correlation between the folds by defining each cluster as a fold. This method suffers from several challenges that do not appear in $\widehat{CV_{c}}.$ First, using LCO forces the number of folds to be equal to the number of clusters. Another issue is when different clusters contain a substantially different number of observations, in which case LCO prediction error estimator can be biased with respect to the generalization error. In addition, validity of LCO is challenged when some clusters have different distribution than other clusters. In this case, as opposed to the generalization error definition, the observations in $X_{k}$ and $X_{-k}$ are nonidentically distributed.

Another cross-validation variant that is relevant for a balanced longitudinal data setting is leave observation from each cluster out (Wu and Zhang 2002, LOFCO). This method is relevant for the case when the goal is to predict a new observation for each one of the subjects that appear in the training set. The folds partitioning mechanism in LOFCO is that different folds refer to different time-points, such that each fold contains the observations that were collocated at the same time-point across all the subjects. This partitioning is feasible due to the balanced data design assumption. The challenges in this method are similar to the challenges mentioned above: it requires a balanced data design, the number of folds are forced by the data structure, $X_{k}$ and $X_{-k}$ are not identically distributed — in particular their time-points covariate is nonidentically distributed.

LCO, LOFCO and h-blocking reflect the understanding that nonstandard CV may be needed in presence of correlations, and offer solutions for very specific types of datasets and correlation settings that apply to any modeling technique. In contrast, $\widehat{CV_{c}}$ can be applied in a wide range of types of datasets and correlation settings, however it is limited to linear models.

One last estimator that should be compared to $\widehat{CV_{c}}$ is

[TABLE]

where $H$ satisfies, $\hat{\boldsymbol{y}}(X;T)=H\boldsymbol{y}.$ This estimator has the same spirit as generalized CV (Craven and Wahba 1978, GCV) which approximates LOO for i.i.d. data as follows:

[TABLE]

Eq. (14) was first proposed by Altman (1990) in context of time-series and later by Opsomer et al. (2001) in context of spatial data analysis. Altman’s motivating application was selecting the best bandwidth for modeling the trend in the data. He argues that replacing $\mathrm{tr}\big{(}H\big{)}$ in GCV by $\mathrm{tr}\Big{(}H\mathrm{Cov}\big{(}\boldsymbol{y},\boldsymbol{y}\big{)}\Big{)}$ denoises the correlation between observations. Although Eq. (14) and $\widehat{CV_{c}}$ are different and developed for different applications, they share some similarity in suggesting a corrected version for CV rather than controlling the partitioning scheme. Analyzing numerically Eq. (14) did not yield promising results, in fact its approximated expectation did not converge to the same scale as the approximated generalization error, CV and $\widehat{CV_{c}}.$ We conclude that the heuristic argument behind the derivation of Eq. (14) as an estimator of the generalization error does not hold in practical cases where the rigorous $\widehat{CV_{c}}$ gives valid results.

2.7 Estimating $\mathrm{Cov}(\boldsymbol{y},\boldsymbol{y})$

In most applications $\mathrm{Cov}(\boldsymbol{y},\boldsymbol{y})$ is unknown in advance and therefore in order to implement $\widehat{CV_{c}},$ it should be estimated. In many cases, such as when GLS is implemented, the covariance matrix is already estimated in the model fitting part and can be also used in $\widehat{CV_{c}}.$ The use of estimated covariance matrix instead of the true covariance matrix can add variance to $\widehat{CV_{c}}$ and even can make $\widehat{CV_{c}}$ biased, especially when the estimation of the covariance matrix is imprecise, due to small sample size or other reasons. Several papers analyze the effect of plugging in the estimated covariance matrix instead of the true one in context of prediction error estimation, in various CV versions (Altman 1990; Francisco-Fernandez and Opsomer 2005) or AIC-type versions (Vaida and Blanchard 2005; Liang et al. 2008; Greven and Kneib 2010; Rabinowicz and Rosset 2020). Some papers derive extra corrections for their prediction error estimators which are theoretically valid in certain use cases, but are not applicable due to computational complexity (Liang et al. 2008), others show numerically that the prediction error estimator that are based on the estimated covariance matrices instead of the true covariance matrices performs well. Using simulations and real data analysis, we show in Section 3, that the effect of using the estimated covariance matrices in $\widehat{CV_{c}}$ on its bias and variance is small, especially when the sample size of the training set is not very small. These results support the results of Altman (1990); Vaida and Blanchard (2005); Francisco-Fernandez and Opsomer (2005).

3 NUMERICAL RESULTS

This section compares $\widehat{CV_{c}}$ and CV, with respect to the approximated generalization error, using simulation and real datasets analyses. Different prediction goals and correlation structures are analyzed. Relevant datasets and code can be found in [https://github.com

/AssafRab/CVc](https://github.com/AssafRab/CVc).

3.1 Simulation

The dependent variable, $\boldsymbol{y}\in\mathbb{R}^{n},$ was sampled from the following model:

[TABLE]

where

•

the random effects, $u_{i},\;\boldsymbol{b}_{i,j}=[b_{1},b_{2}]$ and $\epsilon_{i,j,k}$ are independent and distributed as follows:

[TABLE]

•

the covariates, $\{\boldsymbol{x}_{r}\}_{r=1}^{9}$ are:

–

$x_{i,j,k,1}=1,\;x_{i,j,k,2}=k,\;\forall i,j,k$ are the intercept and the time covariates,

–

$x_{i,j,k,r}=\eta_{i}+\delta_{i,j,k},\;\forall i,j,k$ and $\forall r\in\{3,...,9\},$ where $\eta_{i}\overset{ind}{\sim}N(0,1),\;\delta_{i,j,k}\overset{ind}{\sim}N(0,1)$ are independent,

•

$z_{i,j,k,1}=1,\;z_{i,j,k,2}=k\;\forall i,j,k$ are the covariates for random intercept and random slope.

This settings was simulated for three different sample sizes, $n=300/400/500$ (where $I$ varies respectively to $6/8/10$ ).

This is a hierarchical clustered structured with $I$ clusters of $50$ observations each, where within each cluster there are five subclusters of ten observations each. There is a random intercept for the high-level clusters and random intercept and slope for the subclusters. Therefore the covariance of $\boldsymbol{y}|\boldsymbol{x}_{1},...,\boldsymbol{x}_{9}$ is:

[TABLE]

GLS was fitted using LOO for eight nested models. Model $1$ contains the intercept and time, Model $2$ also contains $\boldsymbol{x}_{3},$ and so on. Model $8$ contains all the covariates.

3.1.1 Estimating Prediction Error

$T_{tr}$ is a random subset of $T$ of size $n-1.$ $T_{te}$ is a single observation, drawn from the same marginal distribution, however with new independent realizations of all the random effects. Therefore, when implementing CV, while in $T:$

[TABLE]

in $\{T_{tr},T_{te}\}:$

[TABLE]

The generalization error was approximated by averaging $\big{(}y_{te}-\hat{y}(\boldsymbol{x}_{te};T_{tr})\big{)}^{2},$ based on $1000\times n$ samples of $\{T_{te},T_{tr}\}.$ The densities of CV and $\widehat{CV_{c}}$ were approximated based on $1000$ samples of $T.$ Since in this setting $\mathrm{Cov}(\boldsymbol{y}_{tr},y_{te})=0,$ the correction in this case is $2/n\times\mathrm{tr}\big{(}H_{cv}\mathrm{Cov}(\boldsymbol{y},\boldsymbol{y})\big{)}.$

Figure 1 shows the distribution of CV and $\widehat{CV_{c}}$ including their means for the saturated model when $n=400,$ compared to the generalization error. Two versions of CV and $\widehat{CV_{c}},$ are presented — when the variance parameters are known, and when they are estimated.

As can be seen, $\widehat{CV_{c}}$ is an unbiased estimator of the generalization error, while CV estimator is biased. This is true for both versions, when the variance parameters are known and when they are estimated. Also, as expected, estimating the variance parameters increases the variance of $\widehat{CV_{c}}.$ Still, the density of the $\widehat{CV_{c}}$ version with the estimated variance parameters is similar to the version with the known variance parameters — their averages are $60.14$ and $60.09$ (compared to $60.00$ for the approximated generalization error), their standard deviations are $12.07$ and $13.02,$ respectively.

In order to asses the performance of $\widehat{CV_{c}}$ version with the estimated variance parameters, compared to the version with the known variance parameters, a two sample Anderson-Darling test (Anderson and Darling 1952) was used. The tested statistic is:

[TABLE]

where one sample uses $\widehat{CV_{c}}$ version with the estimated variance parameters, and the other sample uses $\widehat{CV_{c}}$ version with the true variance parameters. Implementing the function ad.test of the package kSamples in R software, the p-value of the test is $0.33.$ The result indicates that in this setting there is no evidence for significant difference between the distribution of $\widehat{CV_{c}}-generalization\;error$ when the variance parameters are known in advance or estimated.

Figure 2 shows the same analysis as in Figure 1, however for various sample sizes.

As can be seen in Figure 2, the variance of the two $\widehat{CV_{c}}$ versions becomes similar as the sample size increases. While for $n=300$ the standard deviations of the version with the true variance parameters is $14.43$ and for the estimated is $16.04,$ when $n=500$ they are $11.25$ and $11.93,$ respectively. This is expected, since larger sample sizes provide more accurate variance parameters estimators. Also, as the sample size increases the CV bias, $w_{cv},$ decreases. This is specific to our setting, where $\mathrm{Cov}(\boldsymbol{y},\boldsymbol{y})$ becomes more sparse as the sample size increases and therefore $\mathbb{E}\big{[}2/n\times\mathrm{tr}\big{(}H_{cv}\mathrm{Cov}(\boldsymbol{y},\boldsymbol{y})\big{)}\big{]}$ decreases.

3.1.2 $\mathrm{Cov}(\boldsymbol{y}_{tr},y_{te})\neq 0$ and Model Selection

In this simulation $\widehat{CV_{c}}$ is analyzed under a different prediction goal than in Section 3.1.1: estimating the generalization error of new observations from the same high-level clusters, but from a different subcluster, i.e.,

[TABLE]

where $u_{i}$ is a random effect realization that appears in the training sample, and $\boldsymbol{b}_{te}=[b_{te,1},b_{te,2}]^{t}$ is a new random effect realization vector. The new prediction goal derives a different correction — since in this setting $y_{te}$ is correlated with $\boldsymbol{y}_{tr}$ through the random effect $u,$ then the correction in this case is $2/n\times\mathrm{tr}\big{(}H_{cv}\mathrm{Cov}(\boldsymbol{y},\boldsymbol{y}|u_{1},u_{2}...,u_{I})\big{)},$ for more details see Section 2.4.1.

Two different predictive algorithms were implemented:

•

LMM, which is the model that should be used in this setting under the normality assumption, since it utilizes explicitly the correlation between $y_{te}$ and $\boldsymbol{y},$ using the relevant estimated random effects realizations, $\hat{u}_{1},...,\hat{u}_{I}.$

•

GLS, which is inferior to LMM under the normality assumption, however unlike LMM, GLS is relevant for use cases where distributional assumptions cannot be taken. This is since GLS can be interpreted as an extension of the least squares algorithm, which does not rely on distributional assumptions.

Figure 3 shows the model density estimation of $\widehat{CV_{c}}$ and CV compared to the approximated generalization error in the new setting when LMM and GLS are implemented.

As can be seen in Figure 3, $\widehat{CV_{c}}$ estimates the generalization error unbiasedly, while CV underestimates it. The CV bias in Figure 3(b) is relatively small with respect to the biases in Figure 3(a) and Figure 2:

•

The reason that the bias in Figure 3(b) is smaller than the bias in Figure 2 although in both settings GLS is implemented, is that the deviation of $P_{y_{te}|\boldsymbol{x}_{te},T_{tr}}$ from $P_{y_{k}|\boldsymbol{x}_{k},T_{-k}}$ is smaller in the setting of Figure 3(b), as expressed in the corrections — $2/n\times\mathrm{tr}\big{(}H_{cv}\mathrm{Cov}(\boldsymbol{y},\boldsymbol{y}|u_{1},u_{2}...,u_{I})\big{)}$ in the setting of Figure 3(b) and $2/n\times\mathrm{tr}\big{(}H_{cv}\mathrm{Cov}(\boldsymbol{y},\boldsymbol{y})\big{)}$ in the setting of Figure 2.

•

The reason that the bias in Figure 3(b) is smaller than the bias in Figure 3(a) although the deviation of $P_{y_{te}|\boldsymbol{x}_{te},T_{tr}}$ from $P_{y_{k}|\boldsymbol{x}_{k},T_{-k}}$ is the same in both settings is that in Figure 3(a) LMM is implemented and therefore the correlation setting is utilized more than in Figure 3(b). While in GLS the realization of $\boldsymbol{b}$ affects the fixed effects estimates (see Section 2.4.2), in LMM the realization of $\boldsymbol{b}$ affects both, the fixed effects estimates and the estimates of $u_{1},...,u_{I}.$

For demonstrating the model selection performance, Figure 4 presents the agreement rates of $\widehat{CV_{c}}$ and CV with the approximated generalization error, over the repeated simulation runs. The results are presented for different sample sizes and for different predictive models (LMM and GLS).

As can be seen from Figures 4, when LMM is implemented $\underset{h\in\mathcal{H}}{\mathrm{argmin}}\;\widehat{CV_{c}}(h)$ performs better than $\underset{h\in\mathcal{H}}{\mathrm{argmin}}\;\widehat{CV}(h),$ however when GLS is implemented both criteria perform the same. It is important to note that in order to have a high agreement rate with the oracle, estimating unbiasedly the generalization error, as $\widehat{CV_{c}}$ does, is not enough since the variance of the estimator can mix the ranks of the models relative to the oracle. Therefore, there is no guarantee that $\underset{h\in\mathcal{H}}{\mathrm{argmin}}\;\widehat{CV_{c}}(h)$ performs better than $\underset{h\in\mathcal{H}}{\mathrm{argmin}}\;\widehat{CV}(h).$ This is also applies to other model selection criteria that are based on comparing the models’ estimated prediction errors. However, in practice, as we see here and in the real dataset examples in the next section, $\underset{h\in\mathcal{H}}{\mathrm{argmin}}\;\widehat{CV_{c}}(h)$ performs better than $\underset{h\in\mathcal{H}}{\mathrm{argmin}}\;\widehat{CV}(h)$ when the CV bias is large.

In addition, for LMM, both $\underset{h\in\mathcal{H}}{\mathrm{argmin}}\;\widehat{CV_{c}}(h)$ and $\underset{h\in\mathcal{H}}{\mathrm{argmin}}\;\widehat{CV}(h)$ with the true variance parameters perform better than the versions with the estimated variance parameters, while for GLS both perform similarly. This can be explained by the key role of the variance in predicting using LMM, compared to predicting using GLS. Also, when the sample size is small ( $n=300$ ), it is more difficult to estimate accurately the variance parameters, and the difference between the two $\underset{h\in\mathcal{H}}{\mathrm{argmin}}\;\widehat{CV_{c}}(h)$ versions is substantial.

3.2 Real Data Analysis

This section presents analysis of two real datasets, a dataset with a clustered correlation structure, and a dataset with hierarchical spatial correlation structure.

3.2.1 Black Friday Dataset — Clustered Correlation Structure

The Black Friday dataset222Presented by Analytics Vidhya. contains information of $737,577$ transactions made in a retail store on Black Friday by $5,891$ customers. The median number of transactions by a customer is $53,$ and the range is $[5,1025].$ The dataset is available in [https://github.com/

AssafRab/CVc](https://github.com/AssafRab/CVc).

A training sample of all transactions from $100$ random customers was drawn and the goal is to fit a predictive model that predicts the purchase amount of a new transaction of a new random customer. Therefore, while in CV setting $\mathrm{Cov}(\boldsymbol{y}_{-k},y_{k})\neq 0,$ in the prediction goal setting $\mathrm{Cov}(\boldsymbol{y}_{tr},y_{te})=0.$ Therefore the condition of Theorem 2.1 is not satisfied $(P_{T_{te},T_{tr}}\neq P_{T_{k},T_{-k}})$ and K-fold CV is unsuitable.

Three GLS models were fitted using different covariates (see Table 6) with the covariance matrix $\mathrm{Cov}(y_{i,j},y_{i^{\prime},j^{\prime}})=\sigma^{2}_{customer}1_{[i=i^{\prime}]}+\sigma^{2}_{\epsilon}1_{[i=i^{\prime},j=j^{\prime}]},$ where $i$ is customer index and $j$ is the index for an observation in a customer’s set of observations. $\sigma^{2}_{customer}$ and $\sigma^{2}_{\epsilon}$ were estimated using restricted maximum likelihood of normal distribution.

CV and $\widehat{CV_{c}}$ were calculated using the training sample. Test error was calculated by averaging the prediction error of observations of other random $900$ customers using GLS models that were fitted by $T.$ This procedure was repeated $50$ times with different random training and test sets, and the generalization error was approximated by averaging the test errors of the $50$ runs.

Figure 6 presents the means of CV, $\widehat{CV_{c}},$ and the test error over the $50$ runs, as well as the confidence intervals of the means.

As can be seen in Figure 6, $\widehat{CV_{c}}$ estimates the generalization error better than CV for all the three models. Also, unlike CV, $\widehat{CV_{c}}$ selects the same model as the oracle, Model 1, and follows properly the prediction error estimation trend across the models. Analyzing the business aspect of this example, leads to the surprising conclusion that including only the product category as a fixed effect gives the best prediction model, and adding covariates like age and gender does not improve predictive performance, when the data are properly analyzed using $\widehat{CV_{c}}.$

3.2.2 California Housing Dataset — Clustered Random Field Correlation Structure

Another dataset that was analyzed is the California housing dataset, which contains house values and other housing parameters in California. The dataset has a hierarchical spatial correlation structure — each observation has spatial coordinates value, where some of the observations share the same coordinates value and therefore define a cluster. This correlation structure is similar to the correlation structure in Sections 3.1, however with spatial data rather than longitudinal data. The dataset is available in the python scikit-learn package and the code is available in https://github.com/AssafRab/CVc.

To emphasize the hierarchical clustering structure, all the clusters containing only a single observation were excluded from the analysis ( $8,237$ observations out of $20,640$ ), so the analyzed dataset contains $12,403$ observations from $4,353$ clusters. The median number of observations in a cluster is $2$ and the range is $[2,15].$

A training sample, $T,$ of $700$ independent clusters was randomly drawn. The prediction goal is to predict the ’median house value in a block’ of a new observation from a different cluster than the clusters in $T.$ Three GPR models with different covariates were fitted (see Table 8) with the following covariance function:

[TABLE]

where $\boldsymbol{z}_{i}\in\mathbb{R}^{2}$ contains the latitude and longitude of observation $i$ and $K_{exp}(\cdot)$ is the exponential kernel function. The parameters of $K_{exp}(\cdot)$ and $\sigma^{2}$ were fitted using maximum likelihood of normal distribution.

Since the prediction goal is to predict a new observation from a different cluster than the clusters in $T,$ then $\mathrm{Cov}(\boldsymbol{y}_{-k},y_{k})\neq\mathrm{Cov}(\boldsymbol{y}_{-k},y_{te}).$ Therefore the condition of Theorem 2.1 is not satisfied and standard K-fold CV is unsuitable. It is important to note that unlike the Black Friday example, here $\mathrm{Cov}(\boldsymbol{y}_{-k},y_{te})\neq 0$ although $y_{te}$ is from a new cluster. This is due to the spatial correlation structure of the dataset, i.e., the latent random function $\boldsymbol{s}.$ The $\widehat{CV_{c}}$ correction in this case is $2/n\times\mathrm{tr}\big{(}H_{cv}\mathrm{Cov}(\boldsymbol{y},\boldsymbol{y}|\boldsymbol{s})\big{)},$ where $\mathrm{Cov}(\boldsymbol{y},\boldsymbol{y}|\boldsymbol{s})$ , which expresses the extra correlation that the clusters contribute over the spatial correlation of the data (see Section 2.4.1), is estimated as follows:

[TABLE]

The analysis was repeated $50$ times with different random $T,$ at each run CV and $\widehat{CV_{c}}$ were calculated and the test error was calculated by averaging the prediction error of the remaining observations of the $3,653$ clusters that are not in $T$ using GPR models that were fitted by $T.$ Averaging the test error over the $50$ runs approximates the generalization error.

Figure 8 presents the means of $CV,$ and $\widehat{CV_{c}},$ and the test error over the $50$ runs, as well as the confidence intervals of the means.

Similarly to Figure 6, $\widehat{CV_{c}}$ estimates the generalization error better than CV for all the models, selects the same model as the oracle (Model 3), and follows properly the prediction error estimation trend across the models. Also, while in Figure 6 CV underestimates the prediction error of the saturated model more than the smaller models, here CV underestimates the saturated model less than the smaller models. Therefore, biasedness of CV prediction error can be expressed in model selection in different ways — sometimes by favoring over-parameterized models and sometimes by favoring under-parameterized models.

4 Conclusion and Discussion

In this paper we tackle the problem of applying CV as an estimate of generalization error in non-i.i.d. situations. While the fundamental concerns that this presents are widely acknowledged, a clear understanding of when adjustments are needed, and what adjustments are appropriate, seems lacking in much of the literature (Roberts et al. 2017; Saeb et al. 2017; Anderson et al. 2018).

We first present a general formulation of the bias in using CV in presence of correlations, which leads to a clear general definition of settings where no correction to CV is needed (Theorem 2.1). It shows that non-i.i.d. situations can still facilitate correctness of regular CV, as long as the dependence structure between training and prediction points is consistent in CV and actual prediction. This simple result appears to contradict some previous claims in the literature. An example can be found in Roberts et al. (2017), where mechanisms of controlling folds partitioning are suggested for Kriging tasks, based on the conception that CV is always over-optimistic for non-i.i.d. data.

We then present a derivation of a bias correction for linear models under general correlation structures (Theorem 2.2), which is used in establishing a bias corrected cross-validation version, $\widehat{CV_{c}}$ . To implement our correction, it is necessary to specify the covariance structures within the training set and between the training and prediction sets. This is typically also required to choose a modeling approach. For example, in a simple linear mixed model with normal assumptions, if one assumes that the random effect realizations are the same when predicting, then LMM prediction is appropriate, while if random effect realizations are new, then using GLS for estimation of fixed effects only for prediction is more appropriate (Verbeke 1997). However, it is important to emphasize that the validity of the correction $\widehat{CV_{c}}$ does not depend on selection of an appropriate modeling approach. In other words, if one mistakenly uses GLS where LMM is appropriate, $\widehat{CV_{c}}$ still gives an unbiased estimate of generalization error for the resulting model, as long as the covariance structure is correctly specified.

In practice, the covariance matrices are typically not fully known, but partially estimated from the data (for example, variance parameters in LMM can be estimated using restricted maximum likelihood, Verbeke (1997)), and this is also required for applying $\widehat{CV_{c}}$ to correct CV results. This could potentially add uncertainty to the estimates, but as demonstrated in Section 3, it does not tend to affect their expectation.

A fundamental assumption that is taken throughout this paper is that the marginal distributions of the training and prediction sets are the same. In case the marginal distributions are different, i.e., the observations in the training set are drawn from a different marginal distribution than the prediction data, then $\widehat{CV_{c}}$ is unsuitable. This scenario requires further research and relevant applications are given in Sections 2.5.1 and 2.5.2. Other important use cases of different marginal distributions for training and prediction sets are discussed in Rabinowicz and Rosset (2020) and solutions for these use cases are proposed in context of transductive prediction error (rather than generalization error approach taken here).

An interesting situation also covered by $\widehat{CV_{c}}$ and not often discussed, is when the training set contains i.i.d. observations, but new data points where predictions are made are actually correlated with the training set (for example, new observations in the same set of cities). In this case, if the correlation structure is known, then $\widehat{CV_{c}}$ can still be used to estimate the prediction error.

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Altman (1990) Altman, N. S. (1990), “Kernel smoothing of data with correlated errors,” Journal of the American Statistical Association , 85, 749–759.
2Anderson et al. (2018) Anderson, C., Hafen, R., Sofrygin, O., Ryan, L., and HBG Dki Community (2018), “Comparing predictive abilities of longitudinal child growth models,” Statistics in Medicine , 38, 3555–3570.
3Anderson and Darling (1952) Anderson, T. W. and Darling, D. A. (1952), “Asymptotic theory of certain ”goodness of fit” criteria based on stochastic processes,” The Annals of Mathematical Statistics , 23, 193–212.
4Arlot et al. (2010) Arlot, S., Celisse, A., et al. (2010), “A survey of cross-validation procedures for model selection,” Statistics Surveys , 4, 40–79.
5Breiman (2001) Breiman, L. (2001), “Random forests,” Machine Learning , 45, 5–32.
6Breslow and Clayton (1993) Breslow, N. E. and Clayton, D. G. (1993), “Approximate inference in generalized linear mixed models,” Journal of the American statistical Association , 88, 9–25.
7Burman (1989) Burman, P. (1989), “A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods,” Biometrika , 76, 503–514.
8Burman et al. (1994) Burman, P., Chow, E., and Nolan, D. (1994), “A cross-validatory method for dependent data,” Biometrika , 81, 351–358.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Cross Validation for Correlated Data

Cross-Validation for Correlated Data

Abstract

1 INTRODUCTION

2 CV FOR CORRELATED DATA

2.1 K-fold Estimator and Generalization Error

Remark 2.1**.**

2.2 A General Formulation of CV Bias

2.3 Criterion for CV Unbiasedness

Theorem 2.1**.**

Proof.

2.4 CV Correction

Definition 2.1**.**

Definition 2.2**.**

Theorem 2.2**.**

Proof.

Definition 2.3**.**

2.4.1 Specifying hte\boldsymbol{h}_{te}hte​ and \mathrm{Cov}\big{(}\boldsymbol{y}_{tr},y_{te}\big{)}

2.4.2 Interpretation of the Results

2.4.3 Comparison with Expected Optimism

2.5 Advanced Correlation Settings

2.5.1 Kriging

2.5.2 Longitudinal Data

2.6 Comparison with Other Methods

2.7 Estimating Cov(y,y)\mathrm{Cov}(\boldsymbol{y},\boldsymbol{y})Cov(y,y)

3 NUMERICAL RESULTS

3.1 Simulation

3.1.1 Estimating Prediction Error

3.1.2 Cov(ytr,yte)≠0\mathrm{Cov}(\boldsymbol{y}_{tr},y_{te})\neq 0Cov(ytr​,yte​)=0 and Model Selection

3.2 Real Data Analysis

3.2.1 Black Friday Dataset — Clustered Correlation Structure

3.2.2 California Housing Dataset — Clustered Random Field Correlation Structure

4 Conclusion and Discussion

Remark 2.1.

Theorem 2.1.

Definition 2.1.

Definition 2.2.

Theorem 2.2.

Definition 2.3.

2.4.1 Specifying $\boldsymbol{h}_{te}$ and $\mathrm{Cov}\big{(}\boldsymbol{y}_{tr},y_{te}\big{)}$

2.7 Estimating $\mathrm{Cov}(\boldsymbol{y},\boldsymbol{y})$

3.1.2 $\mathrm{Cov}(\boldsymbol{y}_{tr},y_{te})\neq 0$ and Model Selection