Capturing Between-Tasks Covariance and Similarities Using Multivariate   Linear Mixed Models

Aviv Navon; Saharon Rosset

arXiv:1812.03662·stat.ME·October 3, 2019

Capturing Between-Tasks Covariance and Similarities Using Multivariate Linear Mixed Models

Aviv Navon, Saharon Rosset

PDF

1 Repo

TL;DR

This paper introduces MrRCE, a multivariate linear mixed model approach that captures within-group coefficient similarities for multi-response prediction, outperforming existing methods in synthetic and real data scenarios.

Contribution

The paper proposes a novel multivariate linear mixed model estimator that directly models and estimates within-group coefficient similarities, improving prediction accuracy.

Findings

01

Outperforms competitors in synthetic data experiments.

02

Effective in real-world multi-response prediction tasks.

03

Encourages coefficients for the same variable to share signs and magnitudes.

Abstract

We consider the problem of predicting several response variables using the same set of explanatory variables. This setting naturally induces a group structure over the coefficient matrix, in which every explanatory variable corresponds to a set of related coefficients. Most of the existing methods that utilize this group formation assume that the similarities between related coefficients arise solely through a joint sparsity structure. In this paper, we propose a procedure for constructing an estimator of a multivariate regression coefficient matrix that directly models and captures the within-group similarities, by employing a multivariate linear mixed model formulation, with a joint estimation of covariance matrices for coefficients and errors via penalized likelihood. Our approach, which we term Multivariate random Regression with Covariance Estimation (MrRCE) encourages structured…

Tables2

Table 1. Table 1: NYC Taxi Rides . Mean and standard deviation of the MSE, estimated over K = 26 𝐾 26 K=26 cutoffs.

Model	Mean	Std
MrRCE	3.85e-3	4.57e-3
Ridge	4.59e-3	5.34e-3
Sep. Ridge	4.59e-3	5.34e-3
MRCE	4.61e-3	5.12e-3
Group Lasso	5.68e-3	7.72e-3
Sep. Lasso	5.75e-3	7.12e-3
OLS	2.00e-2	1.40e-2

Table 2. Table 2: Avocado Prices . Mean and standard deviation of the MSE, estimated over K = 10 𝐾 10 K=10 folds.

Model	Mean	Std
MrRCE	53.9e-2	22.6e-2
MRCE	63.4e-2	29.0e-2
Group Lasso	66.7e-2	29.9e-2
Sep. Ridge	71.0e-2	38.7e-2
Ridge	71.5e-2	39.8e-2
Sep. Lasso	72.0e-2	36.0e-2
OLS	73.1e-2	41.3e-2

Equations88

Y = X B + E

Y = X B + E

ar g B, Ω min - n lo g ∣ Ω ∣ + tr [\frac{1}{n} (Y - X B)^{T} Ω (Y - X B)] + λ_{1} ∥ B ∥_{1} + λ_{2} j \neq = j^{'} \sum ∣ ω_{j j^{'}} ∣

ar g B, Ω min - n lo g ∣ Ω ∣ + tr [\frac{1}{n} (Y - X B)^{T} Ω (Y - X B)] + λ_{1} ∥ B ∥_{1} + λ_{2} j \neq = j^{'} \sum ∣ ω_{j j^{'}} ∣

Y

Y

E

Y = Z Γ + E

Y = Z Γ + E

E

E

C = C_{ρ} = 1 ρ ⋮ ρ ρ ⋱ \dots \dots ⋱ ρ ρ ⋮ ρ 1

C = C_{ρ} = 1 ρ ⋮ ρ ρ ⋱ \dots \dots ⋱ ρ ρ ⋮ ρ 1

L (Y, Γ; Θ)

L (Y, Γ; Θ)

= L_{Y ∣ Γ} (Y ∣ Γ; Ω) L_{Γ} (Γ ∣ σ^{2}, ρ)

ℓ (Y, Γ; Θ) = tr [\frac{1}{n} Ω (Y - Z Γ)^{T} (Y - Z Γ)] - lo g ∣ Ω ∣ + tr [\frac{1}{p} Δ Γ^{T} Γ] - lo g ∣ Δ ∣

ℓ (Y, Γ; Θ) = tr [\frac{1}{n} Ω (Y - Z Γ)^{T} (Y - Z Γ)] - lo g ∣ Ω ∣ + tr [\frac{1}{p} Δ Γ^{T} Γ] - lo g ∣ Δ ∣

\hat{Θ} = ar g Θ min ℓ (Y, Γ; Θ) + λ_{ω} j \neq = j^{'} \sum ∣ ω_{j j^{'}} ∣

\hat{Θ} = ar g Θ min ℓ (Y, Γ; Θ) + λ_{ω} j \neq = j^{'} \sum ∣ ω_{j j^{'}} ∣

C = U D U^{T} and Z Z^{T} = L S L^{T}

C = U D U^{T} and Z Z^{T} = L S L^{T}

\tilde{Y} = \tilde{Z} \tilde{Γ} + \tilde{E}

\tilde{Y} = \tilde{Z} \tilde{Γ} + \tilde{E}

\tilde{Γ}

\tilde{Γ}

\tilde{E}

Y = Z Γ + E

Y = Z Γ + E

E \sim M V N_{n \times q} (0, I_{n}, Σ := Ω^{- 1}), Γ \sim M V N_{p \times q} (0, I_{p}, σ^{2} D_{ρ})

Y \sim M V N_{n \times q} (0, S, σ^{2} D_{ρ}) + M V N_{n \times q} (0, I_{n}, Σ)

Y \sim M V N_{n \times q} (0, S, σ^{2} D_{ρ}) + M V N_{n \times q} (0, I_{n}, Σ)

Q_{t}^{1} =

Q_{t}^{1} =

Q_{t}^{2} =

(g y) \sim N (0, [Δ^{- 1} \otimes A A^{T} Δ^{- 1} \otimes Z A^{T} Δ^{- 1} \otimes A Z^{T} Σ \otimes I_{n} + Δ^{- 1} \otimes Z Z^{T}] := [Σ_{11} Σ_{21} Σ_{12} Σ_{22}])

(g y) \sim N (0, [Δ^{- 1} \otimes A A^{T} Δ^{- 1} \otimes Z A^{T} Δ^{- 1} \otimes A Z^{T} Σ \otimes I_{n} + Δ^{- 1} \otimes Z Z^{T}] := [Σ_{11} Σ_{21} Σ_{12} Σ_{22}])

g ∣ y \sim N (Σ_{12} Σ_{22}^{- 1} y, Σ_{11} - Σ_{12} Σ_{22}^{- 1} Σ_{21})

g ∣ y \sim N (Σ_{12} Σ_{22}^{- 1} y, Σ_{11} - Σ_{12} Σ_{22}^{- 1} Σ_{21})

E [G^{T} G ∣ Y, Θ_{t - 1}]_{i, j} = E [g_{i}^{T} g_{j} ∣ y, Θ_{t - 1}]

E [G^{T} G ∣ Y, Θ_{t - 1}]_{i, j} = E [g_{i}^{T} g_{j} ∣ y, Θ_{t - 1}]

ar g Ω ⪰ 0 min tr [\frac{1}{n} Ω Q_{t}^{1}] - lo g ∣ Ω ∣ + λ_{ω} j \neq = j^{'} \sum ∣ ω_{j j^{'}} ∣

ar g Ω ⪰ 0 min tr [\frac{1}{n} Ω Q_{t}^{1}] - lo g ∣ Ω ∣ + λ_{ω} j \neq = j^{'} \sum ∣ ω_{j j^{'}} ∣

ar g σ > 0, ρ \in [0, 1) min tr [\frac{1}{p} Δ Q_{t}^{2}] - lo g ∣ Δ ∣

γ^{*} = (\tilde{Z}^{T} R^{- 1} \tilde{Z} + L^{- 1})^{- 1} \tilde{Z}^{T} R^{- 1} y

γ^{*} = (\tilde{Z}^{T} R^{- 1} \tilde{Z} + L^{- 1})^{- 1} \tilde{Z}^{T} R^{- 1} y

y = \tilde{Z} γ + ϵ

y = \tilde{Z} γ + ϵ

ϵ \sim N (0, Σ_{0} \otimes I_{n} := Σ), γ \sim N (0, Λ_{0} \otimes I_{p} := Λ)

\left(\begin{array}[]{c}\boldsymbol{\gamma}\\ \mathbf{y}\end{array}\right)\sim N\left(\mathbf{0},\left[\begin{array}[]{cc}\Lambda&\Lambda\tilde{Z}^{T}\\ \tilde{Z}\Lambda&\tilde{Z}\Lambda\tilde{Z}^{T}+\Sigma\end{array}\right]\right)

\left(\begin{array}[]{c}\boldsymbol{\gamma}\\ \mathbf{y}\end{array}\right)\sim N\left(\mathbf{0},\left[\begin{array}[]{cc}\Lambda&\Lambda\tilde{Z}^{T}\\ \tilde{Z}\Lambda&\tilde{Z}\Lambda\tilde{Z}^{T}+\Sigma\end{array}\right]\right)

\hat{γ}_{BLUP}

\hat{γ}_{BLUP}

= Λ \tilde{Z}^{T} (\tilde{Z} Λ \tilde{Z}^{T} + Σ)^{- 1} y

\hat{γ}_{R R} = (\tilde{Z}^{T} \tilde{Z} + K)^{- 1} \tilde{Z}^{T} y

\hat{γ}_{R R} = (\tilde{Z}^{T} \tilde{Z} + K)^{- 1} \tilde{Z}^{T} y

\hat{γ}_{RR}

\hat{γ}_{RR}

\hat{γ}_{RR}

\hat{γ}_{RR}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AvivNavon/MrRCE
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Capturing Between-Tasks Covariance and Similarities

Using Multivariate Linear Mixed Models

Aviv Navon

Department of Statistics and Operations Research

Tel-Aviv university

Tel-Aviv, Israel

[email protected]

&Saharon Rosset

Department of Statistics and Operations Research

Tel-Aviv university

Tel-Aviv, Israel

[email protected]

Abstract

We consider the problem of predicting several response variables using the same set of explanatory variables. This setting naturally induces a group structure over the coefficient matrix, in which every explanatory variable corresponds to a set of related coefficients. Most of the existing methods that utilize this group formation assume that the similarities between related coefficients arise solely through a joint sparsity structure. In this paper, we propose a procedure for constructing an estimator of a multivariate regression coefficient matrix that directly models and captures the within-group similarities, by employing a multivariate linear mixed model formulation, with a joint estimation of covariance matrices for coefficients and errors via penalized likelihood. Our approach, which we term Multivariate random Regression with Covariance Estimation (MrRCE) encourages structured similarity in parameters, in which coefficients for the same variable in related tasks sharing the same sign and similar magnitude. We illustrate the benefits of our approach in synthetic and real examples, and show that the proposed method outperforms natural competitors and alternative estimators under several model settings.

K****eywords Covariance selection $\cdot$ EM algorithm $\cdot$ Multivariate regression $\cdot$ Penalized likelihood $\cdot$ Regularization methods $\cdot$ Sparse precision matrix

1 Introduction

In many cases, a common set of predictor variables is used for predicting different but related target variables. For example, an on-demand transportation company may attempt forecasting demand and supply in different time frames and geographic locations; a real-estate firm may be interested in predicting both the construction costs and the sale prices of residential apartments, given a set of project’s physical and financial covariates, and external economic variables.

The general task of modeling multiple responses using a joint set of covariates can be expressed using multivariate regression (MR), or multiple response regression — a generalization of the classical regression model to regressing $q>1$ responses on $p$ predictors. In the MR settings, one is presented with $n$ independent observations, $\left\{\left(X_{i},Y_{i}\right)\right\}_{i=1}^{n}$ , where $X_{i}\in\mathbb{R}^{p}$ and $Y_{i}\in\mathbb{R}^{q}$ contain the predictors and responses for the $i$ th sample, respectively. Let $X=\left(X_{1},...,X_{n}\right)^{T}=\left(\mathbf{x}_{1},...,\mathbf{x}_{p}\right)\in\mathbb{R}^{n\times p}$ denote the predictor matrix and $Y=\left(Y_{1},...,Y_{n}\right)^{T}=\left(\mathbf{y}_{1},...,\mathbf{y}_{q}\right)\in\mathbb{R}^{n\times q}$ denote the response matrix. For simplicity of notation, assume that the columns of $X$ and $Y$ have been centered so that we need not consider an intercept term. We further assume that the i.i.d $N_{q}\left(0,\Sigma\right)$ error terms are collected into an $n\times q$ error matrix $E$ , where $\Sigma$ is the among-tasks covariance matrix. The multivariate regression model is given by,

[TABLE]

where $B$ is a $p\times q$ regression coefficient matrix. The random matrices in (1) are assumed to follow a matrix-variate normal distribution [9, 15], $E\sim MVN_{n\times q}\left(0,I_{n},\Sigma\right)$ and $Y\sim MVN_{n\times q}\left(XB,I_{n},\Sigma\right)$ . For reasons that will later become clear, when considering the noise structure of the MR model, the precision matrix, $\Omega=\Sigma^{-1}$ , is commonly the preferred object.

Straightforward prediction and estimation with the MR model can become quite challenging when the number of predictors and responses is large relative to $n$ , as it requires one to estimate $pq$ parameters. The univariate regression model ( $q=1$ ) has been widely studied, and numerous methods have been developed for variable selection (support recovery) and coefficients estimation. A naive approach to the MR problem is to apply one of these methods to each of the $q$ tasks independently. However, in many cases, the different problems are related, and this oversimplified approach fails to utilize all the information contained in the data (see, e.g., [3, 31]). For a review of Bayesian approaches for estimation and prediction with the MR model see [11] and references therein.

In the MR literature, many approaches seek to reduce the number of parameters to be estimated through a penalized (or constrained) least squares framework. [5] generalized the classical Reduced-Rank Regression (RRR) [1, 21, 38] to high dimensional settings, estimating a low-rank coefficient matrix by penalizing the rank of $B$ . [43] proposed a method called Factor Estimation and Selection (FES), in which an $L_{1}$ -penalty is applied to the singular values of $B$ . FES induces sparsity in the singular values of $B$ , conducting dimension reduction and coefficients estimation simultaneously. One major drawback of dimension reduction techniques, is that the interpretation of the model is often limited, in terms of the original data, since the set of predictors is reduced to a few important principal factors.

The multivariate regression framework naturally induces a group structure over the coefficient matrix, $B$ , in which every explanatory variable, $\mathbf{x}_{i}\text{ for }i=1,...,p$ , corresponds to a group of $q$ coefficients, $B_{i}=\left(\beta_{i1},...,\beta_{iq}\right)$ (see Figure 1). While many approaches make no assumption over the group structure, others utilize it for learning structured sparsity. In the multi-task learning literature, the $L_{1}/L_{2}$ -penalty, also known as the group lasso penalty [44], has been applied with the rows of $B$ as groups. The $L_{1}/L_{2}$ -penalty can be viewed as an intermediate between the $L_{1}$ -penalty used in lasso regression [35] and the $L_{2}$ -penalty used in ridge regression [19], aimed at utilizing the relatedness among tasks for identifying the joint support, i.e., the set of predictors with non-zero coefficients across all $q$ responses [28]. [29] proposed a mixed constraint function, by applying both the lasso and the group lasso penalties to the elements and rows of $B$ , respectively. This approach produces element-wise as well as row-wise sparsity in the coefficient matrix. [36] studied a different constraint function, placing an $L_{\infty}$ -penalty over the rows of $B$ . As noted by the authors, this method is only suitable for variable selection and not for estimation. Extensions of mixed norm penalties to overlapping groups have been proposed in order to handle more general and complex group structures (see, e.g., [23, 27]). These methods produce highly interpretable models, however, they are limited to the case $\Omega\propto I_{n}$ , and do not account for correlated errors. [31, 7, 39] have recently shown that accounting for this additional information in MR problems can be beneficial for both coefficients estimation and prediction.

In multivariate normal theory, the entries of $\Omega$ that equal zero correspond to pairs of variables that are conditionally independent, given all of the other variables in the data. The problem of sparse precision matrix estimation has drawn considerable recent attention, and several methods have been proposed for both support recovery and parameter estimation. Perhaps the most widely used approach is the graphical lasso [12], in which simultaneous sparsity structure identification and coefficients estimation are achieved by minimizing the $L_{1}$ -regularized negative log-likelihood function of $\Omega$ [45, 8, 30]. Recently, sparse precision matrix estimation has also been considered in regression frameworks, in which the main goal for this explicit estimation is to improve prediction [40, 31].

[31] proposed Multivariate Regression with Covariance Estimation (MRCE), a method for sparse multivariate regression that directly accounts for correlated errors. MRCE minimizes the negative log-likelihood function with an $L_{1}$ -penalty for both $B$ and $\Omega$ ,

[TABLE]

where $\text{tr}\left(\cdot\right)$ denotes the trace, $\lambda_{1}$ and $\lambda_{2}$ are the regularization parameters and $\omega_{jj^{\prime}}$ is the $\left(j,j^{\prime}\right)$ element of $\Omega$ . [26] extended the approach of [31] to allow for weighted $L_{1}$ -penalties over the elements of $B$ and $\Omega$ . [42] considered a similar objective to the one in (2), and proposed an algorithm for the sparse estimation of the coefficient and inverse covariance matrices. However, unlike [31], their method aimed at improving the estimation of $\Omega$ , rather than $B$ . Our work further leverages correlations between the different problems to improve the accuracy of the estimators and predictions, by not only accounting for the correlation between the error terms but the similarities between the coefficients as well.

While MRCE accounts for correlated responses through the precision matrix $\Omega$ , it does not learn structured sparsity in $B$ , essentially selecting relevant covariates for each response separately. In a recent work, [39] proposed an algorithm for the multivariate group lasso with covariance estimation, replacing the lasso penalty in (2) with an $L_{1}/L_{2}$ -penalty over a pre-specified group structure. [7] developed a method within the reduced-rank regression framework that simultaneously performs variable selection and sparse precision matrix estimation. These methods for learning group sparsity assume that the sparsity structure is known a-priori. Instead, [33] proposed an approach for group sparse multivariate regression that can jointly learn both the response structure and regression coefficients with structured sparsity.

All the above methods which considered a group structure over the coefficient matrix, essentially assume that the within-group similarities arise solely through a joint sparsity structure. In many applications, these structured (and unstructured) sparsity assumptions are not suitable, for instance, if one expects many covariates of small or medium effect. Furthermore, these sparse estimators encourage within-group coefficients to be of similar absolute magnitude, and do not favor same sign coefficients. However, in various real-life examples it is more natural to encourage coefficients within the same group to also share a sign. To address these issues, we construct an estimator for the multivariate regression by directly modeling and capturing the within-group similarities, while also accounting for the error covariance structures. Our method, titled Multivariate random Regression with Covariance Estimation (MrRCE), involves a multivariate linear mixed model with an underlying group structure over the coefficient matrix, designed to encourage related coefficients to share a common sign and similar magnitude.

Multivariate Linear Mixed Models (mvLMMs) [17] are MR models that relate a joint set of covariates to multiple correlated responses. mvLMMs are applied in many real-life problems and frequently used in genetics due to their ability to account for relatedness among observations (see, e.g., [25, 22, 24, 37]). The mvLMMs model can be viewed as a generalization of MR (similar to the way Linear Mixed Models (LMMs) are a generalization of linear regression models), allowing both fixed and random effects. Consider the MR problem (1), but with an additional term for the set of random predictors, collected into the matrix $Z=\left(Z_{1},...,Z_{n}\right)^{T}=\left(\mathbf{z}_{1},...,\mathbf{z}_{r}\right)\in\mathbb{R}^{n\times r}$ . The mvLMM model is given by,

[TABLE]

where $B$ is a $p\times q$ fixed effect coefficient matrix and $\Gamma$ is an $r\times q$ random effect coefficient matrix. Here, $R$ and $G$ are the common covariance matrices of columns and rows of $\Gamma$ , respectively.

In this paper we consider the problem of estimation and prediction under the multivariate random effect regression — an mvLMMs model strictly involving random effects,

[TABLE]

Under the proposed formulation and unlike the standard mvLMM framework, we are interested in estimating not only the covariance components but also in predicting the random component $\Gamma$ . Our method accounts for correlations between responses and similarities among coefficients, captured by estimating a joint equicorrelation covariance matrix for the rows of $\Gamma$ (see Eq. 5 for details). Hence, the MrRCE method is an example of what one could call structured similarity learning, in which the different coefficient groups are assumed to be independent, whereas a within-group similarity is encouraged. This covariance structure for the random coefficient matrix reduces the MR problem of estimating $pq$ parameters, into the problem of estimating two covariance components — the coefficients’ common variance, and the intra-group correlation coefficient, or similarity level. The estimation of the covariance structure is achieved through a penalized likelihood, adding an $L_{1}$ -penalty over the off-diagonal entries of $\Omega=\Sigma^{-1}$ .

The remainder of the paper is structured as follows. Section 2 describes the MrRCE method and corresponding Expectation-Maximization (EM) based computational algorithm. Section 3 establishes a connection between the proposed method and the multivariate Ridge estimator. Simulation studies are performed in Section 4 to compare our method with competing estimators, and Section 5 contains two real data applications of MrRCE. Section 6 concludes with a brief discussion.

2 The MrRCE Method

Consider the random effect regression model (4) with $r=p$ . Assume both the error matrix $E$ and the coefficient matrix $\Gamma$ follow a matrix variate normal distribution,

[TABLE]

Further assume an equicorrelation structure for the matrix $C$ , controlled by the unknown intra-group correlation coefficient $\rho\in[0,1)$ ,

[TABLE]

The unknown parameter $\rho$ can be thought of as a relative measure of the* within-group similarity* [6]. Large values for $\rho$ correspond to high similarity among members of the same group, leading to a similar magnitude and same sign coefficients, whereas $\rho=0$ corresponds to $i.i.d$ draws for the entries of the coefficient matrix $\Gamma$ . We refer to the random variable $\Gamma$ as unobserved data, and to $\left(Y,\Gamma\right)$ as the full data. Denote the likelihood function of the full data by $\mathcal{L}\left(\cdot\right)$ , and the collection of parameters by $\Theta=\left\{\Omega,\sigma^{2},\rho\right\}$ , we have,

[TABLE]

Thus, the negative log-likelihood function of the complete data is given by (up to a constant),

[TABLE]

where $\Delta^{-1}=\sigma^{2}C$ . We construct an estimator of $\Theta$ using a penalized normal log-likelihood, adding an $L_{1}$ -penalty over the off-diagonal entries of $\Omega$ ,

[TABLE]

where $\lambda_{\omega}>0$ is a regularization parameter.

2.1 The Algorithm

We propose an iterative, EM-based [10] algorithm for solving (6). Alg. 1 provides a schematic overview of the MrRCE algorithm.

Using eigendecomposition (similar to [46, 13]), we write,

[TABLE]

where $S$ and $D:=D_{\rho}=diag\left(d_{1}\left(\rho\right),...,d_{q}\left(\rho\right)\right)$ are diagonal matrices, and $U$ is independent of $\rho$ . We then multiply (4) by the orthogonal matrices $U\text{ and }L^{T}$ from the right and left correspondingly, to obtain,

[TABLE]

where $\tilde{Y}=L^{T}YU$ , $\tilde{Z}=L^{T}Z$ , and,

[TABLE]

We lose the $\tilde{\cdot}$ notation and assume (with a slight abuse of notation) that the original data is of the form,

[TABLE]

namely,

[TABLE]

Next, we describe an EM-based algorithm for solving (6) under the assumptions (8).

E-step. Denote $\Theta_{t-1}$ the estimation for $\Theta$ at iteration $t-1$ . At step $t$ , we wish to evaluate the following expressions,

[TABLE]

We let $\otimes$ denote the Kronecker product and $\text{vec}\left(\cdot\right)$ the vectorization operator111Let $\text{vec}\left(\cdot\right)$ denote the concatenation of a $k\times l$ -dimensional matrix’s columns into a $kl$ -dimensional vector.. For a matrix $A\in\mathbb{R}^{k\times p}$ , we let $A\Gamma:=G=\begin{pmatrix}\mathbf{g}_{1}&\cdots&\mathbf{g}_{q}\end{pmatrix}$ , with $\mathbf{g}_{j}$ the $j$ th column of $G$ . The joint distribution of $\mathbf{g}=\text{vec}\left(G\right)$ and $\mathbf{y}=\text{vec}\left(Y\right)$ is given by,

[TABLE]

hence, the conditional distribution of $\mathbf{g}\mid\mathbf{y}$ is given by,

[TABLE]

In order to evaluate (9) and (10), we calculate $\mathbb{E}\left[\Gamma\mid Y,\Theta_{t-1}\right]$ and $\mathbb{E}\left[\Gamma^{T}A^{T}A\Gamma\mid Y,\Theta_{t-1}\right]$ for $A=I_{p},Z$ . The former is the Empirical-Best Linear Unbiased Predictor (E-BLUP) [16, 17] (see Predicting $\Gamma$ below), whereas the latter can be easily obtained from (11) since,

[TABLE]

M-step. The minimization of the objective over $\Theta$ can be split into two disjoint minimization problems:

[TABLE]

The first minimization problem is exactly the $L_{1}$ -penalized precision matrix estimation problem considered by [45, 8, 12, 31, 20], among others. We solve (12) by applying the graphical lasso algorithm of [12]. The second minimization problem, (13), can be easily solved in closed-form by utilizing the diagonal form of $\Delta$ .

Predicting $\Gamma$ . Given $\hat{\Theta}$ , our estimation for $\Theta$ , we compute the E-BLUP [16, 17] for $\boldsymbol{\gamma}=\text{vec}\left(\Gamma\right)$ . Denote, $\tilde{Z}=I_{q}\otimes Z,\text{ }L=\hat{\sigma}^{2}\hat{D}_{\rho}\otimes I_{p}$ and $R=\hat{\Omega}^{-1}\otimes I_{n}$ , the E-BLUP $\mathbf{\boldsymbol{\gamma}}^{*}$ for $\boldsymbol{\gamma}$ , is given by,

[TABLE]

Alternatively, as proved by [18], $\boldsymbol{\gamma}^{*}=L^{T}\tilde{Z}^{T}\Psi^{-1}\mathbf{y}$ where, $\Psi=\tilde{Z}L\tilde{Z}^{T}+R$ . In order to predict $\Gamma$ , we simply compute $\Gamma^{*}=\text{unvec}\left(\boldsymbol{\gamma}^{*}\right)$ , where $\text{unvec}\left(\cdot\right)$ represents the reversal of the $\text{vec}\left(\cdot\right)$ operation.

Starting value and Stopping Criteria. We initialize $\Omega_{0}=I_{q}$ , $\Delta^{-1}=I_{q}$ , and consider two alternatives for the MrRCE algorithm’s stopping criteria.

Set a tolerance value, $\tau>0$ . Iterate until the sum of absolute changes in the values of $\Theta$ in two successive iterations is smaller than the tolerance value. 2. 2.

Set a tolerance value, $\tau>0$ , and let $l_{t}$ denote the log-likelihood at iteration $t$ . Iterate until the relative change in the log-likelihood value, $\left|\frac{l_{t-1}-l_{t}}{l_{t-1}}\right|$ , is smaller than $\tau$ .

Convergence. The MrRCE algorithm is a variant of the EM algorithm for penalized likelihood, hence each step ensures a decrease in the objective, and the algorithm’s convergence is guaranteed (see e.g. [14]).

3 Connection to Ridge Regression

We present a connection between the MrRCE method and the Ridge Regression (RR) estimator [19]. More specifically, we explore a special case in which the BLUP for $\Gamma$ derived by the MrRCE algorithm is equivalent to the multivariate RR estimator [4].

Consider the model,

[TABLE]

The joint distribution of $\left(\textbf{y},\boldsymbol{\gamma}\right)$ is given by,

[TABLE]

and the BLUP for the random coefficient vector is the expectation of $\boldsymbol{\gamma}$ conditional on $\mathbf{y}$ ,

[TABLE]

The RR estimator can be extended to the multivariate case as in [4],

[TABLE]

where $K\succ 0$ is the $pq\times pq$ ridge matrix. We apply the generalized Sherman-Morrison-Woodbury [32, 41] formula to the inverse of $\tilde{Z}^{T}\tilde{Z}+K$ , to obtain,

[TABLE]

Eq. 14 can be simplified as follow,

[TABLE]

Thus, under the $i.i.d$ error model, i.e., $\Sigma_{0}=\sigma_{\epsilon}^{2}I_{q}$ , setting $K=\left(\Sigma_{0}\otimes I_{p}\right)\Lambda^{-1}$ yields,

[TABLE]

This is a well known connection between the RR estimator and BLUP which proves the following result:

Proposition 1.

Assuming $\hat{\Sigma}_{0}\propto I$ , the prediction for $\Gamma$ obtained by the MrRCE algorithm is equivalent to the multivariate RR estimator with Ridge matrix $K=\left(\hat{\Sigma}_{0}\otimes I_{p}\right)\hat{\Lambda}^{-1}$ .

To better understand this result, consider the case $\Sigma_{0}=\sigma_{\epsilon}^{2}I_{q}$ and $\Lambda_{0}=\sigma_{\gamma}^{2}C$ , where $C=C_{\rho}$ is an equicorrelation matrix with parameter $\rho$ . Let $K=\left(\Sigma_{0}\otimes I_{p}\right)\Lambda^{-1}=\eta C^{-1}\otimes I_{p}$ where $\eta=\left(\sigma_{\epsilon}/\sigma_{\gamma}\right)^{2}$ . It is easy to verify that $C^{-1}$ is itself an equicorrelation matrix, $C^{-1}=aI_{q}+bJ_{q}$ , where,

[TABLE]

For simplicity, we only examine the penalty structure for $q=2,p=1$ . Denote the coefficients vector by $\boldsymbol{\gamma}=\left(\gamma_{11},\gamma_{12}\right)^{T}$ . The ridge penalty is given by,

[TABLE]

Note that (15) can be reduced to the univariate ridge penalty by setting $\rho=0$ , i.e., by considering $i.i.d$ coefficients. For $\rho>0$ , the second term in (15) kicks-in. We note that $b<0$ for $\rho\in\left(0,1\right)$ , meaning that the second penalty term in (15) is negative, for same sign coefficients. This simple example illustrates that the MrRCE method favors equal sign coefficients, within groups.

4 Simulation Study

In this section, we compare the performance of the MrRCE method to other multivariate regression estimators, over several settings of simulated data sets. We show that the MrRCE method significantly outperforms all competitors, in terms of Model Error, for the vast majority of simulated settings.

4.1 Estimators

We construct estimators using natural competitors of the MrRCE method, and report the results for the following methods:

Ordinary Least Squares (OLS): Perform $q$ separate LS regressions. 2. 2.

Group Lasso: Place an $L_{1}/L_{2}$ -penalty over the rows of the coefficient matrix, with $3$ -fold cross-validation (CV) for the selection the tuning parameter. 3. 3.

Ridge Regression: The tuning parameter is selected via leave-one-out cross-validation (LOO-CV) and is shared across all task. 4. 4.

MRCE: The tuning parameters are selected using $5$ -fold CV. 5. 5.

MrRCE: The $L_{1}$ -regularization parameter (for the graphical lasso algorithm) is selected via 3-fold CV.

4.2 Models

For each settings and every replication, we generate an $n\times p$ predictor matrix $Z$ with rows drawn independently from $N_{p}\left(0,\Sigma_{Z}\right)$ , where $\left(\Sigma_{Z}\right)_{ij}=\rho_{Z}^{\left|i-j\right|}$ and $\rho_{Z}=.7$ (similar to [43, 29, 31]). Following [31], the coefficient matrix $\Gamma$ is generated as the element-wise product of three matrices: First, we sample a $p\times q$ matrix $W\sim MVN_{p\times q}\left(0,I_{p},\sigma^{2}C_{\rho}\right)$ , with $C_{\rho}=I+\rho\left(J-I\right)$ , where $J$ is a matrix of ones and $I$ is the identity matrix, both of dimensions $q\times q$ . The values of $\rho$ are ranging from [math] to $0.8$ , where $\rho=0$ corresponds to $i.i.d$ samples, $\gamma_{ij}\sim N\left(0,\sigma^{2}\right)$ . Next, we set,

[TABLE]

where $\odot$ denotes the element-wise product. The entries of the $p\times q$ matrix $K$ are drawn independently from $\text{Ber}\left(1-s\right)$ , and the elements in each row of the matrix $Q$ are all equal zero or one, according to $p$ independent Bernoulli draws with success probability $1-s_{g}$ . Hence, setting $s,s_{g}>0$ will induce element-wise and group sparsity in $\Gamma$ . The rows of the error matrix $E$ are drawn independently from $N_{q}\left(0,\Sigma\right)$ . We consider several structures for the error covariance matrix, specified in the form of the transformed error covariance matrix, $\tilde{\Sigma}:=U^{T}\Sigma U$ , where $U$ is the orthogonal matrix obtained via eigendecomposition over the matrix $C_{\rho}$ (see Eq. 7):

Independent Errors. The errors are drawn i.i.d form $N_{q}\left(0,I_{q}\right)$ . 2. 2.

Autoregressive Error Covariance — $AR\left(1\right)$ . We let $\tilde{\Sigma}_{ij}=\rho_{E}^{\left|i-j\right|}$ . The transformed error covariance matrix is dense, whereas the precision matrix $\tilde{\Omega}$ is a sparse, banded matrix. 3. 3.

Fractional Gaussian Noise (FGN). The transformed error covariance matrix is given by,

[TABLE]

with $H=.95$ . Both the transformed error covariance matrix $\tilde{\Sigma}$ and its inverse have a dense structure. 4. 4.

Equicorrelation Covariance Structure. We let $\tilde{\Sigma}_{ij}=\rho_{E}$ for $j\neq i$ , and $\tilde{\Sigma}_{ij}=1$ for $j=i$ . Both the transformed error covariance matrix and its inverse have a dense structure.

4.3 Performance Measure

For a given realization of the coefficient matrix and method $m$ , and for each replication $r$ , let $\boldsymbol{\gamma}_{j}^{\left(r\right)}$ denote the true coefficient vector and $\hat{\boldsymbol{\gamma}}_{j}^{\left(r\right)}\left(m\right)$ denote the estimated coefficient vector, both for the $j$ th response. The mean-squared estimation error is given by,

[TABLE]

where $p\left(z\right)$ and $\Sigma_{Z}$ are the density function and covariance matrix of $z$ , respectively. We evaluate the performance using the model error (ME), following the approach of [3, 43, 31],

[TABLE]

The ME over all $N$ replications is averaged to obtain our performance measure,

[TABLE]

4.4 Results

We simulate $N=200$ replications with $n=50$ , $p=20$ and $q=5$ , for each setting. The correlation parameter $\rho$ ranges from [math] to $0.8$ , with $0.2$ steps. Significance tests were performed using paired $t-$ test.

Independent Errors. We first consider an identity error covariance structure, $\tilde{\Sigma}=I_{q}$ , and set the sparsity and group sparsity levels at $s=0.2,s_{g}=0$ . Hence, for small values of $\rho$ we do not expect any advantage for our method over the competitors. The average ME is displayed in Figure 2. Indeed, for $\rho=0,.2$ , our method achieves no significant improvement over Group Lasso. For $\rho>.2$ , the MrRCE method achieves significant improvement over all competitors (all $p$ -values $<1\text{e}-2$ ).

Autoregressive (AR). Let $\tilde{\Sigma_{ij}}=\rho_{E}^{\left|i-j\right|}$ , with $\rho_{E}=0.75$ . We use two settings for the sparsity levels, $s=s_{g}=0$ , and $s=s_{g}=0.1$ . Although the transformed precision matrix is a sparse, banded matrix, the assumptions of MrRCE only partially hold, as we induce sparsity in $\Gamma$ as well. The results are displayed in Figure 3. For both settings, the MrRCE method achieves the best ME performance, with a significant improvement over competing methods (all $p$ -values $<1\text{e}-3$ ).

Fractional Gaussian Noise. This covariance structure for the error terms was also considered by [31]. We construct a dense coefficient matrix, by setting $s=s_{g}=0$ . The results are presented in Figure 4, showing that our proposed method provides a considerable improvement over competitors (all $p$ -values $<1\text{e}-19$ ). The margin by which MrRCE outperforms the other methods increases with $\rho$ .

Equicorrelation. Finally, we let $\tilde{\Sigma}_{ij}=\rho_{E}=0.9$ for $i\neq j$ , and set $s=s_{g}=0.1$ . The results are displayed in Figure 5. The MRCE method exploits the correlated errors, achieving better performance than the Group Lasso, Ridge and OLS methods, and is second only to MrRCE, which significantly outperforms all competitor methods for all values of $\rho$ (all $p$ -values $<1\text{e}-8$ ).

5 Applications

We consider two publicly available real-life datasets:

NYC Taxi Rides222The data is available at http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml.. The data consists of the daily number of New-York City (NYC) taxi rides, ranging from January 2016 to December 2017. 2. 2.

Avocado Prices333The data is available at https://www.kaggle.com/neuromusic/avocado-prices.. The data was provided by the Hass Avocado Board website and represents weekly retail scan data for national retail volume (units) and price.

We measure and report the performance of the following methods:

Ordinary Least Squares. 2. 2.

Group Lasso. Apply $3$ -fold CV for the selection of the tuning parameter. 3. 3.

Separate Lasso. Perform $q$ separate lasso regression models with $3$ -fold CV for selecting the tuning parameters. 4. 4.

Ridge Regression. Perform $q$ separate ridge regression models, with shared regularization parameter, selected via LOO-CV (e.g. same ridge penalty for all $pq$ parameters). 5. 5.

Separate Ridge Regression. Perform $q$ separate ridge regression models with LOO-CV for selecting the tuning parameters. 6. 6.

MRCE. Apply 5-fold CV for selecting the regularization parameters. 7. 7.

MrRCE. Apply $3$ -fold CV for selecting the graphical lasso regularization parameter.

NYC Taxi Rides. We consider the problem of forecasting the performance of $q=2$ taxi vendors in NYC, using historical records of the daily number of rides, spanning from January 2016 to December 2017 ( $n=730$ ). This multivariate time-series data is generated according to human activities and actions, and as such can be expected to be strongly affected by multiple seasonalities and holidays effects. For a regular period $P$ , we utilize the Fourier series to model the periodic effects [2, 34], by constructing $2\cdot N_{P}$ features of the form,

[TABLE]

We account for the weekly and yearly seasonalities and introduce the corresponding $P$ -cyclic covariates. For a holiday $H$ , which occurs at times $T\left(H\right)$ , we use a simple indicator predictors of the form,

[TABLE]

Lastly, we incorporate covariates for the modeling of a piecewise linear trend. These transformations shift the multivariate time-series problem into a feature space with $p=68$ , where the linear assumption is appropriate. We denote the transformed observations by,

[TABLE]

where $Z\left(t\right)\in\mathbb{R}^{p}$ contains measurements of the covariates, $Y\left(t\right)\in\mathbb{R}^{q}$ contains the $q$ responses, and $Y_{j}\left(t\right)\in\left[0,1\right]$ represents the scaled response of the $j$ th task at time $t$ , obtained by dividing the original observation by the maximal response value for that given task.

We evaluate the forecast performance of the different methods using cross-validation like approach, in which we produce $K$ forecasts at multiple cutoff points along the history [34]. For cutoff $k=0,...,K-1$ , we use the first $n_{train,k}=365+k\cdot 14$ days for training, and the next $n_{test}=14$ observations as the test set. The performance of method $m$ over the $k$ th “fold” is measured according to the Mean Squared Error (MSE),

[TABLE]

where $T_{k}$ are the time indices for the $k$ th test set, and $\hat{y}_{j,t}\left(m\right)$ is the forecast for the $j$ th task at time $t$ , produced using method $m$ . Using the above procedure, we obtain $K=26$ realizations of the MSE, $\left\{MSE_{k}^{m}\right\}_{k=0}^{K-1}$ , for each method $m$ . The mean and standard deviation of the MSE for each of the methods are reported in Table 1. The MrRCE method attains the best forecast performance, with lowest mean MSE and smallest standard deviation, followed by the Ridge and Separate-Ridge methods. A paired $t$ -test confirms that the improvement in accuracy achieved by our method is significant (all $p$ -values $<0.05$ ). We also note that the estimated similarity level for this data is $\hat{\rho}=0.992$ .

Avocado Prices. We consider the weekly average avocado prices for $q=5$ regions in the US, spanning from January 2015 to April 2018 ( $n=169$ ). We use national volume metrics and one hot encoding for years ( $p=12$ ) to predict the average avocado prices for each region. The performance is measured according to the MSE, with 10-fold CV. The mean and standard deviation of the MSE, calculated over all folds, are reported in Table 2. Our proposed method attains the best prediction performance, with lowest mean MSE and smallest standard deviation. A paired $t$ -test confirms that the improvement in accuracy is significant (all $p$ -values $<0.05$ ). We also report the estimated similarity level for this data, at $\hat{\rho}=0.689$ .

6 Summary and Discussion

We have presented the MrRCE method to produce an estimator of the covariance components and a predictor of the multivariate regression coefficient matrix. Our method exploits similarities among random coefficients and accounts for correlated errors. We have proposed an efficient EM-based algorithm for computing MrRCE. By using simulated and real data, we have illustrated that the proposed method can outperform the commonly used methods for multivariate regression, in settings were errors or coefficients are related.

Our method can be extended in several ways. For example, one could consider an arbitrary group structure over the coefficient matrix, model the similarities via different covariance structure, or allow for per-group similarity coefficient. In addition, one could extend the MrRCE formulation to also allow for fixed effects, as in (3).

7 Acknowledgement

This research was partially supported by Israeli Science Foundation grant 1804/16.

Bibliography46

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Theodore Wilbur Anderson “Estimating linear restrictions on regression coefficients for multivariate normal distributions” In The Annals of Mathematical Statistics JSTOR, 1951, pp. 327–351
2[2] Harvey C Andrew and Shephard Neil “Structural time series models” In Econometrics 11 , Handbook of Statistics Elsevier, 1993, pp. 261–302
3[3] Leo Breiman and Jerome H Friedman “Predicting multivariate responses in multiple linear regression” In Journal of the Royal Statistical Society: Series B (Statistical Methodology) 59.1 Wiley Online Library, 1997, pp. 3–54
4[4] Philip J Brown and James V Zidek “Adaptive multivariate ridge regression” In The Annals of Statistics JSTOR, 1980, pp. 64–74
5[5] Florentina Bunea, Yiyuan She and Marten H Wegkamp “Optimal selection of reduced rank estimators of high-dimensional matrices” In The Annals of Statistics JSTOR, 2011, pp. 1282–1309
6[6] Chris Chatfield, Jim Zidek and Jim Lindsey “An introduction to generalized linear models” Chapman Hall/CRC, 2010
7[7] Lisha Chen and Jianhua Z Huang “Sparse reduced-rank regression with covariance estimation” In Statistics and Computing 26.1-2 Springer, 2016, pp. 461–470
8[8] Alexandre dÁspremont, Onureena Banerjee and Laurent El Ghaoui “Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data” In Journal of Machine learning research 9.Mar , 2008, pp. 485–516

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Capturing Between-Tasks Covariance and Similarities

Abstract

1 Introduction

2 The MrRCE Method

2.1 The Algorithm

3 Connection to Ridge Regression

Proposition 1**.**

4 Simulation Study

4.1 Estimators

4.2 Models

4.3 Performance Measure

4.4 Results

5 Applications

6 Summary and Discussion

7 Acknowledgement

Proposition 1.