Compressed and Penalized Linear Regression

Darren Homrighausen; Daniel J. McDonald

arXiv:1705.08036·stat.ME·June 11, 2021

Compressed and Penalized Linear Regression

Darren Homrighausen, Daniel J. McDonald

PDF

TL;DR

This paper introduces new methods for linear regression that balance computational efficiency and statistical accuracy, leveraging approximation and regularization techniques to improve estimation and prediction in large datasets.

Contribution

It proposes novel regularized linear regression procedures inspired by approximation theory, demonstrating their benefits for both computation and statistical performance.

Findings

01

Methods improve computational speed on large datasets.

02

Regularization enhances statistical accuracy and interpretability.

03

Approximations can outperform exact solutions in statistical tasks.

Abstract

Modern applications require methods that are computationally feasible on large datasets but also preserve statistical efficiency. Frequently, these two concerns are seen as contradictory: approximation methods that enable computation are assumed to degrade statistical performance relative to exact methods. In applied mathematics, where much of the current theoretical work on approximation resides, the inputs are considered to be observed exactly. The prevailing philosophy is that while the exact problem is, regrettably, unsolvable, any approximation should be as small as possible. However, from a statistical perspective, an approximate or regularized solution may be preferable to the exact one. Regularization formalizes a trade-off between fidelity to the data and adherence to prior knowledge about the data-generating process such as smoothness or sparsity. The resulting estimator tends…

Tables1

Table 1. Table 1: Number of observations for each of the 8 genetics data sets.

dataset	B1	B2	B3	G1	G2	W1	W2	W3
$n$	157614	125056	103394	51751	64966	146828	171776	143570

Equations324

Y = X β_{*} + σ ε,

Y = X β_{*} + σ ε,

ℓ (β) = β min ∣ ∣ X β - Y ∣ ∣_{2}^{2} .

ℓ (β) = β min ∣ ∣ X β - Y ∣ ∣_{2}^{2} .

β_{F C} = β argmin ∣ ∣ Q X β - Q Y ∣ ∣_{2}^{2} .

β_{F C} = β argmin ∣ ∣ Q X β - Q Y ∣ ∣_{2}^{2} .

ℓ (β_{F C}) \leq (1 + ϵ) ℓ (β) .

ℓ (β_{F C}) \leq (1 + ϵ) ℓ (β) .

β min ∣ ∣ Q (X β - Y) ∣ ∣_{2}^{2} = β min (β^{⊤} X^{⊤} Q^{⊤} Q X β - 2 β^{⊤} X^{⊤} Q^{⊤} Q Y) .

β min ∣ ∣ Q (X β - Y) ∣ ∣_{2}^{2} = β min (β^{⊤} X^{⊤} Q^{⊤} Q X β - 2 β^{⊤} X^{⊤} Q^{⊤} Q Y) .

β min (β^{⊤} X^{⊤} Q^{⊤} Q X β - 2 β^{⊤} X^{⊤} Y),

β min (β^{⊤} X^{⊤} Q^{⊤} Q X β - 2 β^{⊤} X^{⊤} Y),

β (λ) := β argmin ∣ ∣ X β - Y ∣ ∣_{2}^{2} + λ ∣ ∣ β ∣ ∣_{2}^{2} .

β (λ) := β argmin ∣ ∣ X β - Y ∣ ∣_{2}^{2} + λ ∣ ∣ β ∣ ∣_{2}^{2} .

β min ∣ ∣ Q (X β - Y) ∣ ∣_{2}^{2} + λ ∣ ∣ β ∣ ∣_{2}^{2} = β min (β^{⊤} X^{⊤} Q^{⊤} Q X β - 2 β^{⊤} X^{⊤} Q^{⊤} Q Y) + λ β^{⊤} β .

β min ∣ ∣ Q (X β - Y) ∣ ∣_{2}^{2} + λ ∣ ∣ β ∣ ∣_{2}^{2} = β min (β^{⊤} X^{⊤} Q^{⊤} Q X β - 2 β^{⊤} X^{⊤} Q^{⊤} Q Y) + λ β^{⊤} β .

β_{F C} (λ) = (X^{⊤} Q^{⊤} Q X + λ I)^{- 1} X^{⊤} Q^{⊤} Q Y .

β_{F C} (λ) = (X^{⊤} Q^{⊤} Q X + λ I)^{- 1} X^{⊤} Q^{⊤} Q Y .

β min (β^{⊤} X^{⊤} Q^{⊤} Q X β - 2 β^{⊤} X^{⊤} Y) + λ β^{⊤} β

β min (β^{⊤} X^{⊤} Q^{⊤} Q X β - 2 β^{⊤} X^{⊤} Y) + λ β^{⊤} β

β_{P C} (λ) = (X^{⊤} Q^{⊤} Q X + λ I)^{- 1} X^{⊤} Y .

β_{P C} (λ) = (X^{⊤} Q^{⊤} Q X + λ I)^{- 1} X^{⊤} Y .

β_{α} (λ) = α β_{F C} (λ) + (1 - α) β_{P C} (λ),

β_{α} (λ) = α β_{F C} (λ) + (1 - α) β_{P C} (λ),

B (λ) = [β_{F C} (λ), β_{P C} (λ)] \in R^{p \times 2},

B (λ) = [β_{F C} (λ), β_{P C} (λ)] \in R^{p \times 2},

α = α \in [0, 1] argmin X B (λ) [α (1 - α)] - Y_{2}^{2} .

α = α \in [0, 1] argmin X B (λ) [α (1 - α)] - Y_{2}^{2} .

α = α \in R^{2} argmin ∣ ∣ X B (λ) α - Y ∣ ∣_{2}^{2} .

α = α \in R^{2} argmin ∣ ∣ X B (λ) α - Y ∣ ∣_{2}^{2} .

β_{α} (λ) := B (λ) α

β_{α} (λ) := B (λ) α

Y_{α} (λ) := X β_{α} (λ),

Y_{α} (λ) := X β_{α} (λ),

df (g) := \frac{1}{σ ^{2}} i = 1 \sum n Cov (g_{i} (Y), Y_{i}),

df (g) := \frac{1}{σ ^{2}} i = 1 \sum n Cov (g_{i} (Y), Y_{i}),

Risk (g) = E ∣ ∣ g (Y) - μ ∣ ∣_{2}^{2} = E ∣ ∣ g (Y) - Y ∣ ∣_{2}^{2} - n σ^{2} + 2 σ^{2} df (g) .

Risk (g) = E ∣ ∣ g (Y) - μ ∣ ∣_{2}^{2} = E ∣ ∣ g (Y) - Y ∣ ∣_{2}^{2} - n σ^{2} + 2 σ^{2} df (g) .

Risk (g) = ∣ ∣ g (Y) - Y ∣ ∣_{2}^{2} - n σ^{2} + 2 σ^{2} df (g),

Risk (g) = ∣ ∣ g (Y) - Y ∣ ∣_{2}^{2} - n σ^{2} + 2 σ^{2} df (g),

GCV (g) = \frac{∣ ∣ g ( Y ) - Y ∣ ∣ _{2}^{2}}{( 1 - df ( g ) / n ) ^{2}} .

GCV (g) = \frac{∣ ∣ g ( Y ) - Y ∣ ∣ _{2}^{2}}{( 1 - df ( g ) / n ) ^{2}} .

X β_{α} (λ) = X B (λ) α = α_{1} X β_{P C} (λ) + α_{2} X β_{F C} (λ) = Φ_{1} Y + Φ_{2} Y = Φ Y

X β_{α} (λ) = X B (λ) α = α_{1} X β_{P C} (λ) + α_{2} X β_{F C} (λ) = Φ_{1} Y + Φ_{2} Y = Φ Y

df = α_{1} tr (X (X^{⊤} Q^{⊤} Q X + λ I)^{- 1} X^{⊤} Q^{⊤} Q) + α_{2} tr (X (X^{⊤} Q^{⊤} Q X + λ I)^{- 1} X^{⊤}) .

df = α_{1} tr (X (X^{⊤} Q^{⊤} Q X + λ I)^{- 1} X^{⊤} Q^{⊤} Q) + α_{2} tr (X (X^{⊤} Q^{⊤} Q X + λ I)^{- 1} X^{⊤}) .

df = α_{1} tr (X (X^{⊤} Q^{⊤} Q X + λ I)^{- 1} X^{⊤} Q^{⊤} Q) + α_{2} tr (X (X^{⊤} Q^{⊤} Q X + λ I)^{- 1} X^{⊤}) .

df = α_{1} tr (X (X^{⊤} Q^{⊤} Q X + λ I)^{- 1} X^{⊤} Q^{⊤} Q) + α_{2} tr (X (X^{⊤} Q^{⊤} Q X + λ I)^{- 1} X^{⊤}) .

λ_{t es t} = λ argmin \frac{1}{n} Y_{t es t} - X_{t es t} β (λ)_{2}^{2} .

λ_{t es t} = λ argmin \frac{1}{n} Y_{t es t} - X_{t es t} β (λ)_{2}^{2} .

\frac{test _{GC V}}{test _{m i n}} = \frac{Y _{t es t} - X _{t es t} β ( λ _{GC V} ) _{2}^{2}}{Y _{t es t} - X _{t es t} β ( λ _{t es t} ) _{2}^{2}} \geq 1.

\frac{test _{GC V}}{test _{m i n}} = \frac{Y _{t es t} - X _{t es t} β ( λ _{GC V} ) _{2}^{2}}{Y _{t es t} - X _{t es t} β ( λ _{t es t} ) _{2}^{2}} \geq 1.

E [β_{*} - β_{r i d g e} (λ) \mbox ∣ \mbox X]

E [β_{*} - β_{r i d g e} (λ) \mbox ∣ \mbox X]

= λV (D^{2} + λ I_{p})^{- 1} V^{⊤} β_{*} .

V [β_{r i d g e} (λ) \mbox ∣ \mbox X]

= σ^{2} V (D^{2} + λ I_{p})^{- 1} D^{2} (D^{2} + λ I_{p})^{- 1} V^{⊤} .

bias^{2} (β_{r i d g e} (λ) \mbox ∣ \mbox X)

bias^{2} (β_{r i d g e} (λ) \mbox ∣ \mbox X)

tr (V [β_{r i d g e} (λ) \mbox ∣ \mbox X])

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Compressed and Penalized Linear Regression

Darren Homrighausen

Department of Statistics, Colorado State University

and

Daniel J. McDonald

Department of Statistics, Indiana University

Darren Homrighausen is currently Visiting Assistant Professor at Southern Methodist University. The authors gratefully acknowledge support from the National Science Foundation (grant DMS-1407543 to DH; grant DMS-1407439 to DJM) and the Institute for New Economic Thinking (grant INO14–00020 to DH and DJM).

Abstract

Modern applications require methods that are computationally feasible on large datasets but also preserve statistical efficiency. Frequently, these two concerns are seen as contradictory: approximation methods that enable computation are assumed to degrade statistical performance relative to exact methods. In applied mathematics, where much of the current theoretical work on approximation resides, the inputs are considered to be observed exactly. The prevailing philosophy is that while the exact problem is, regrettably, unsolvable, any approximation should be as small as possible. However, from a statistical perspective, an approximate or regularized solution may be preferable to the exact one. Regularization formalizes a trade-off between fidelity to the data and adherence to prior knowledge about the data-generating process such as smoothness or sparsity. The resulting estimator tends to be more useful, interpretable, and suitable as an input to other methods.

In this paper, we propose new methodology for estimation and prediction under a linear model borrowing insights from the approximation literature. We explore these procedures from a statistical perspective and find that in many cases they improve both computational and statistical performance.

Keywords: preconditioning; sketching; regularization; ordinary least squares; ridge regression

1 Introduction

Recent work in large-scale data analysis has focused on developing fast and randomized approximations to important numerical linear algebra tasks such as solving least squares problems (Woodruff, 2014; Rokhlin and Tygert, 2008; Drineas et al., 2011) and finding spectral decompositions (Halko et al., 2011; Gittens and Mahoney, 2013; Homrighausen and McDonald, 2016). These approaches, known as compression, sketching, or preconditioning, take a given data set and construct a reduced sized, “compressed,” solution. Theoretical justifications for these approximate approaches upper bounds the difference in the objective function evaluated at the “compressed” solution relative to the full-data solution.

In numerical linear algebra, a major goal is to develop compression algorithms that decrease the computational or storage burden while not giving up too much accuracy. Due to their random nature, theoretical performance bounds are stated with high probability with respect to the subsampling or random projection mechanism (Halko et al., 2011). For example, when confronted with a massive least squares problem, we could attempt to form a compressed solution such that its residual sum of squares is not too much larger than the residual sum of squares of the least squares solution.

In this paper, we take a more statistical perspective: we investigate the performance in terms of parameter estimation and prediction risk. Leveraging insights into the statistical behavior of the most commonly used compressions, we develop and explore novel algorithms for compressed least squares. We show that our methods provide satisfactory, or even superior, statistical performance at a fraction of the computation and storage costs.

1.1 Overview of the problem

Suppose we make $n$ paired, independent observations $X_{i}\in\mathbb{R}^{p}$ and $Y_{i}\in\mathbb{R}$ , $i=1,\ldots,n$ , where $X_{i}$ is a vector of measurements, $Y_{i}$ is the associated response, $n\gg p$ , and both $n$ and $p$ are both very large. Concatenating the vectors $X_{i}$ row-wise into a matrix $\mathbb{X}\in\mathbb{R}^{n\times p}$ and the responses $Y_{i}$ into a vector $Y$ , we assume that there exists a $\beta_{*}\in\mathbb{R}^{p}$ such that

[TABLE]

with $\mathbb{E}\varepsilon=0$ , $\mathbb{V}\varepsilon=I_{n}$ , and $\mathbb{X}$ fixed.

We can seek to estimate a relationship between $X$ and $Y$ via linear regression. That is, for any vector $\beta\in\mathbb{R}^{p}$ , we define $\ell(\beta):=\sum_{i=1}^{n}(Y_{i}-X_{i}^{\top}\beta)^{2}$ . Writing the (squared) Euclidean norm as $\left|\left|\mathbb{X}\beta-Y\right|\right|_{2}^{2}:=\sum_{i=1}^{n}(Y_{i}-X_{i}^{\top}\beta)^{2}$ , a least squares solution is a vector $\widehat{\beta}\in\mathbb{R}^{p}$ such that

[TABLE]

This is given by $\widehat{\beta}=\mathbb{X}^{\dagger}Y$ , where $\mathbb{X}^{\dagger}$ is the Moore-Penrose pseudo inverse of $\mathbb{X}$ . If $\mathbb{X}$ has full column rank, the solution simplifies to $\widehat{\beta}=(\mathbb{X}^{\top}\mathbb{X})^{-1}\mathbb{X}^{\top}Y$ .

The least squares solution can be computed stably in $O(np^{2})$ time using, for example, routines in LAPACK such as the QR decomposition, Cholesky decomposition of the normal equations, or the singular value decomposition (Golub and Van Loan, 2012). It also has a few, well-known statistical properties such as being a minimum variance unbiased estimator or the best linear unbiased estimator. However, classic numerical linear algebraic techniques require substantial random access to $\mathbb{X}$ and $Y$ , so computing $\widehat{\beta}$ can be infeasible or undesirable in practice.

The big-data regime we consider, i.e. $n\gg p$ and both are very large, can happen in many different scientific areas such as psychology, where cellular phones are used to collect high-frequency data on individual actions; atmospheric science, where multiresolution satellite images are used to understand climate change and predict future weather patterns; technology companies, which use massive customer databases to predict tastes and preferences; or astronomy, where hundreds of millions of objects are measured using radio telescopes.

1.2 Prior work

A very popular approach in the approximation literature (Rokhlin and Tygert, 2008; Drineas et al., 2011; Woodruff, 2014) is to generalize equation (2) to include a compression matrix $Q\in\mathbb{R}^{q\times n}$ , with $n\geq q>p$ , $\ell_{Q}(\beta)=\left|\left|Q\mathbb{X}\beta-QY\right|\right|_{2}^{2}$ . The associated approximation to $\widehat{\beta}$ is the fully compressed estimator

[TABLE]

Now, $\widehat{\beta}_{FC}$ can be computed via standard techniques by using the compressed data $Q\mathbb{X}$ and $QY$ . We defer discussion of strategies and trade-offs for specific choices of $Q$ to Section 2.4, but, clearly, some structure on $Q$ is required to enable fast multiplication.

The standard theoretical justification defines a tolerance parameter $\epsilon$ and a compression parameter $q=q(\epsilon)$ such that with high probability (Drineas et al., 2011, 2012),

[TABLE]

In this case, the probability is stated with respect to the process that generates $Q$ only and the data are considered fixed. Thus (4) is a worst-case analysis since it must hold uniformly over all data sets, regardless of the “true” data generating process.

There have been many proposals for how to choose the compression matrix $Q$ and the compression parameter $q$ . A classical approach is to define $Q$ such that $Q\mathbb{X}$ corresponds to uniform random sampling of the rows of $\mathbb{X}$ . The success of this approach, in the sense of obtaining small $\epsilon$ and $q$ in (4), depends crucially on the coherence of $\mathbb{X}$ , defined as the maximum squared-Euclidean norm of a row of an orthogonal matrix that spans the column space of $\mathbb{X}$ (Avron et al., 2010; Drineas et al., 2011; Rokhlin and Tygert, 2008). Note that the coherence is very different from the condition number, which is the ratio of the largest and smallest singular values of $\mathbb{X}$ : the coherence is necessarily in the interval $[p/n,1]$ while the condition number can be any positive real number. If $\mathbb{X}$ has coherence $p/n$ , then $q=O(p\log p)$ is sufficient to obtain a high probability bound while a coherence of $1$ essentially requires all rows to be sampled (Avron et al., 2010).

Therefore, it is important to either reduce the coherence or sample the rows in proportion to their influence on it (this influence is known as a “leverage score” Drineas et al., 2006). Unfortunately, it is as expensive to compute the coherence or the leverage scores as it is to solve the original least squares problem in (2). Hence, if computational complexity is of concern, they are both unavailable.

Instead, a general approach for reducing the coherence of $\mathbb{X}$ is to randomly “mix” its rows. The rows of this new, mixed matrix have smaller coherence (with high probability) and can then be randomly sampled. In particular, Drineas et al. (2011) show that equation (4) holds as long as $q$ grows like $p\log(n)\log(p\log(n))$ ; though it is claimed that smaller $q$ still works well in practice. Readers interested in the practical performance of these randomized algorithms should see Blendenpik (Avron et al., 2010) or LSRN (Meng et al., 2014).

Relative to the computational properties of these compression methods, there has been comparatively little work on their statistical properties. Raskutti and Mahoney (2015) analyze various relative efficiency measures of $\widehat{\beta}_{FC}$ versus $\widehat{\beta}$ as a function of the compression matrix $Q$ . They find that the statistical quality of $\widehat{\beta}_{FC}$ depends on the oblique projection matrix $U(QU)^{\dagger}U$ , where $U$ is the left singular matrix of $\mathbb{X}$ . Additionally, Ma et al. (2015) develop a theoretical framework, which we adopt, to compare the statistical performance of various compression matrices used in (3). In particular, they show that in terms of the mean squared error (MSE) of $\widehat{\beta}_{FC}$ , neither leverage-based sampling nor uniform sampling uniformly dominates the other. That is, leverage-based sampling works better for some datasets while uniform sampling has better MSE for others. Despite using their framework, we take a different perspective: that by combining compression with regularization, we can improve over the uncompressed solution and produce an estimator with both better computational and statistical properties.

1.3 Our contribution

This paper, in contrast with previous research, adopts the perspective that approximations mimicking the least squares estimator $\widehat{\beta}$ may produce faster methods, but may not result in good estimators. Instead, we seek approximations that minimize estimation and/or prediction error. It is well known that the least squares estimator $\widehat{\beta}$ , while being a minimum variance, unbiased estimator, performs poorly in terms of prediction or estimation risk relative to regularized estimators. As approximation and regularization are very similar, this insight suggests the intriguing possibility that it is possible to develop a compressed estimator that performs better statistically, and is cheaper to compute and store, than $\widehat{\beta}$ .

We show that $\widehat{\beta}_{FC}$ , like $\widehat{\beta}$ , is unbiased and hence must have a larger estimation and prediction error. As a remedy, we define a partial compression estimator, notated $\widehat{\beta}_{PC}$ and defined in equation (6). In contrast to $\widehat{\beta}_{FC}$ , $\widehat{\beta}_{PC}$ is a biased estimator. In fact, its bias is such that it performs relatively poorly in practice. Therefore, we propose a linear combination of $\widehat{\beta}_{FC}$ and $\widehat{\beta}_{PC}$ . We find that this combined estimator performs much better than either individually.

Furthermore, as regularized regression outperforms $\widehat{\beta}$ , it is sensible to extend $\widehat{\beta}_{FC}$ , $\widehat{\beta}_{PC}$ , and our linear combination to analogous regularized versions. This introduces a tuning parameter with which to directly calibrate bias and variance. In this paper, we use ridge regression as a regularized least squares method, though other methods such as bridge, lasso, or the nonnegative garrote are alternatives. The ridge regression estimator works by inflating the smallest singular values of $\mathbb{X}$ thereby stabilizing the least squares problem. However, solving for the ridge regression estimator has the same computational complexity as solving equation (1). Therefore, we apply the compression paradigm to ridge regression. We find that in many cases, these estimators can be computed for a fraction of the cost while providing better performance than the original least-squares estimator.

In Section 2 we carefully define the our proposed estimators. Because these estimators contain a tuning parameter, Section 3 gives a data-driven procedure for selecting it with minimal extra computation. Section 4 examines the performance of these compressed estimators via simulation. In particular, we demonstrate that in most cases our methods have better performance than the least-squares solution and occasionally better performance than the uncompressed ridge regression estimator. Likewise, Section 5 looks at the effectiveness of our estimators on two real-data examples. In Section 6, we give theoretical expressions for the bias and variance of our estimators and compare these to the standard results for ridge regression. We show that, to a first order approximation, the compressed estimators require a different tuning parameter than ridge regression to minimize bias and variance. Lastly, Section 7 summarizes our conclusions and presents avenues for further research.

2 Compressed regression

In this section, we describe the standard set up of compressed least squares regression before introducing our modifications and the specific form of the compression matrix $Q$ we consider.

2.1 Compressed least squares regression

The fully compressed least squares estimator, defined in (3), can be written as

[TABLE]

An alternative approach is partial compression (Becker et al., 2017)

[TABLE]

which removes the compression matrix from the cross-product term. Depending on the particular draw and form of the compression matrix $Q$ , there may not be unique solutions to equations (5) or (6).

2.2 Compressed ridge regression

A well used technique to stabilize the least squares problem is known as Tikhonov regularization or ridge regression in applied mathematics (Tikhonov and Arsenin, 1979) and statistics (Hoerl and Kennard, 1970), respectively. The ridge regression problem can be written in the Lagrangian form as

[TABLE]

While $\widehat{\beta}(\lambda)$ has better numerical properties than $\widehat{\beta}(0)\equiv\widehat{\beta}$ because it inflates the small singular values of $\mathbb{X}$ , it improves neither the computational complexity, which is still $O(np^{2})$ , nor the storage, which is $O(np)$ .

In addition to better numerical stability, the ridge solution has lower MSE than $\widehat{\beta}$ for some $\lambda$ . These results beg the question: if $\widehat{\beta}(\lambda)$ is a better overall procedure, why not compress it instead? Leveraging this insight, we define the fully compressed ridge estimator, in analogue to equation (5), as

[TABLE]

The minimizer to equation (8) can be written

[TABLE]

Likewise, analogous to equation (6), the partially compressed ridge estimator solves

[TABLE]

with minimizer

[TABLE]

Ignoring numerical issues for very small $\lambda$ , both of these estimators always have a unique solution regardless of $Q$ and $\mathbb{X}$ .

2.3 Linear combination compressed ridge regression

Instead of choosing between $\widehat{\beta}_{FC}(\lambda)$ and $\widehat{\beta}_{PC}(\lambda)$ , it is reasonable to use a model averaged estimator formed by combining them. Consider the estimator generated by a convex combination

[TABLE]

where $\alpha\in[0,1]$ . A data-driven value $\widehat{\alpha}$ can be computed by forming the matrix

[TABLE]

a column-wise concatenation of $\widehat{\beta}_{FC}(\lambda)$ and $\widehat{\beta}_{PC}(\lambda)$ , and then solving

[TABLE]

There is no reason that the convex constraint will provide the best estimator. Hence, we also consider the unconstrained two-dimensional least squares problem given by

[TABLE]

We emphasize that either version can be computed in $O(n)$ time. Lastly, an estimator of $\beta$ , or a prediction $\widehat{Y}$ , can be produced with

[TABLE]

and

[TABLE]

respectively.

2.4 Compression matrices

The effectiveness of these approaches depends on $q$ , the nature of $Q$ , and the structure of $\mathbb{X}$ and $\beta_{*}$ . For arbitrary $Q$ , the multiplication $Q\mathbb{X}$ would take $O(qnp)$ operations and, hence, could be expensive relative to solving the original least squares problem. However, this multiplication is “embarrassingly parallel” (say, by the map-reduce framework) rendering the multiplication cost somewhat meaningless in contrast to the least squares solution, which is not easily parallelized. Therefore, the limiting computation for solving (8) is only $O(qp^{2})$ .

The structure of $Q$ is chosen, typically, either for its theoretical or computational properties. Examples are standard Gaussian entries, producing dense but theoretically convenient $Q$ , fast Johnson-Lindenstrauss methods, or the counting sketch. A thorough discussion of these methods is outside the scope of this paper. Instead, we use a “sparse Bernoulli” matrix (Kane and Nelson, 2014; Dasgupta et al., 2010; Woodruff, 2014; Achlioptas, 2003). Here, the entries of $Q$ are generated independently where $\mathbb{P}(Q_{ij}=0)=1-1/s$ and $\mathbb{P}(Q_{ij}=-1)=\mathbb{P}(Q_{ij}=1)=1/(2s)$ for some $s\geq 1$ . Then, $\mathbb{E}Q$ has $qn/s$ non-zero entries and can be multiplied quickly with high probability, while equation (5) can be solved without parallelization in $O(qnp/s+qp^{2})$ time on average. Throughout this paper we assume $Q$ is renormalized so that $\mathbb{E}Q^{\top}Q=I_{n}$ .

3 Tuning Parameter Selection

Our methods require appropriate selection of $\lambda$ to achieve good performance. Presumably, if computations or storage are at a premium, computer-intensive resampling methods such as cross-validation are unavailable. Therefore, we develop methods that rely on a corrected training-error estimate of the risk. These corrections depend crucially on the degrees of freedom. Specifically, the degrees of freedom (Efron, 1986) of a procedure $g:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}$ that produces predictions $g(Y)=\widehat{Y}$ is

[TABLE]

where $\sigma^{2}=\mathbb{V}(Y_{i})$ .

If the response vector is distributed according to the homoskedastic model $Y\sim(\mu,\sigma^{2}I_{n})$ , then we can decompose the prediction risk of the procedure $g$ as

[TABLE]

A plug-in estimate of $\textrm{Risk}(g)$ (analogous to $C_{p}$ , Mallows, 1973) is then

[TABLE]

where $\widehat{\textrm{df}}(g)$ and $\widehat{\sigma}^{2}$ are estimates of $\textrm{df}(g)$ and $\sigma^{2}$ , respectively. We discuss strategies for forming $\widehat{\textrm{df}}(g)$ in Section 3.1. As for the variance, ordinarily one would use the unbiased estimator $\widehat{\sigma}^{2}=(n-\textrm{df}(g))^{-1}\left|\left|(I_{n}-\Pi_{\mathbb{X}})Y\right|\right|_{2}^{2}$ , where $\Pi_{\mathbb{X}}=\mathbb{X}(\mathbb{X}^{\top}\mathbb{X})^{\dagger}\mathbb{X}^{\top}=UU^{\top}$ is the orthogonal projection onto the column space of $\mathbb{X}$ . However, computing $I_{n}-\Pi_{\mathbb{X}}$ is just as expensive as computing the least squares solution $\Pi_{\mathbb{X}}Y$ itself. Therefore, to avoid estimating $\sigma^{2}$ , we use generalized cross validation (Golub et al., 1979)

[TABLE]

Crucially, GCV does not require a variance estimator, a major advantage.

3.1 Estimating the degrees of freedom

For any procedure $g$ which is linear in $Y$ , that is, there exists some matrix $\Phi$ which does not depend on $Y$ such that $g(Y)=\Phi Y$ , then $\textrm{df}(g)=\operatorname{tr}(\Phi)$ , the trace of $\Phi$ . Therefore, computing the exact degrees of freedom for $\widehat{\beta}_{\alpha}(\lambda)$ (that is, the linear combination estimator with a fixed $\alpha$ ) is straightforward. In this case,

[TABLE]

where $\Phi_{1}=\alpha_{1}\mathbb{X}(\mathbb{X}^{\top}Q^{\top}Q\mathbb{X}+\lambda I)^{-1}\mathbb{X}^{\top}$ and $\Phi_{2}=\alpha_{2}\mathbb{X}(\mathbb{X}^{\top}Q^{\top}Q\mathbb{X}+\lambda I)^{-1}\mathbb{X}^{\top}Q^{\top}Q$ . So the degrees of freedom of $\widehat{\beta}_{\alpha}(\lambda)$ is

[TABLE]

In particular, both the fully and partially compressed estimators have simple forms for the degrees of freedom which do not need to be estimated.

For the linear combination estimator when $\alpha$ is estimated, as in equations (15) or (16), computing the degrees of freedom is more complicated. This estimator is nonlinear because both the matrix $B(\lambda)$ and the weight vector $\widehat{\alpha}$ are functions of $Y$ . A straighforward estimator of the degrees of freedom for $\widehat{\beta}_{\widehat{\alpha}}(\lambda)$ is created by plugging $\widehat{\alpha}$ into equation (19):

[TABLE]

Though this approximation intuitively underestimates the degrees of freedom, the general idea is used for other nonlinear estimators such as neural networks (Ingrassia and Morlini, 2007).

Alternatively, the degrees of freedom can be computed via Stein’s lemma (Stein, 1981) if we are willing to assume that the response vector is multivariate normal: $Y\sim N(\mu,\Sigma)$ . Then, if $g(Y)$ is continuous and almost differentiable in $Y$ , $\textrm{df}(g)=\mathbb{E}[(\nabla\cdot g)(Y)]$ , where $(\nabla\cdot g)(Y)=\sum_{i=1}^{n}\partial g_{i}/\partial Y_{i}$ is the divergence of $g$ . It immediately follows that $\widehat{\textrm{df}}(g)=(\nabla\cdot g)(Y)$ is an unbiased estimator of $\textrm{df}(g)$ . Though the calculus is tedious, the divergence of the linear combination estimator can be calculated by repeated applications of the chain rule. Unfortunately, we do not yet know how to compute the divergence without forming $Q^{\top}Q$ , which is a large, dense matrix, nor the implications of the required normality assumption. Hence, we consider this a possibly fruitful direction for future research.

3.2 Computing the path

In order to select tuning parameters, we need to compute the estimators quickly for a range of possible $\lambda$ . Luckily, this can be implemented in the same way as with ridge regression. That is we examine $(\mathbb{X}^{\top}Q^{\top}Q\mathbb{X}+\lambda I)^{-1}=R(L^{2}+\lambda I)^{-1}R^{\top},$ where the singular value decomposition is written $Q\mathbb{X}=SLR^{\top}$ . Therefore, we can take the SVD of $Q\mathbb{X}$ once and then compute the entire path of solutions for a sequence of $\lambda$ while only increasing the computational complexity multiplicatively in the number of $\lambda$ values considered.

4 Simulations

In this section, we construct simulations to explore when $\widehat{\beta}_{\alpha}(\lambda)$ performs well.

4.1 Setup

To create data, we generate the design matrix $\mathbb{X}\in\mathbb{R}^{n\times p}$ by independently sampling the rows from a multivariate normal distribution with mean zero and covariance matrix $\Sigma$ which has unit variance on the diagonal and covariance (correlation) $\rho\in\{0.2,\ 0.8\}$ off the diagonal. We then form $Y=\mathbb{X}\beta_{*}+\epsilon$ , where $\epsilon_{i}$ are i.i.d. Gaussian with mean zero and variance $\sigma^{2}$ . In all cases, we take $n=5000$ and let $p=50,\ 100,\ 250$ , or $500$ .

While the performance of our methods depends on all of these design conditions, we have found that the most important factor is the structure of $\beta_{*}$ . For this reason, we examine three related structures intended to illustrate the interaction between $\beta_{*}$ and compression.

In the first case, we take $\beta_{*}\sim\mbox{N}(0,\tau^{2}I_{p})$ . Under this model, ridge regression is Bayes-optimal for $\lambda_{*}=n^{-1}\sigma^{2}/\tau^{2}$ . We set $\tau^{2}=\pi/2$ so that $p^{-1}\mathbb{E}[\left|\left|\beta_{*}\right|\right|_{1}]=1$ . Finally, to ensure that $\lambda_{*}$ is not too small, we take $\sigma=50$ implying $\lambda_{*}\approx 0.32$ . The second structure for $\beta_{*}$ is created to make ridge regression perform poorly: we simply set $(\beta_{*})_{j}\equiv 1$ for all $1\leq j\leq p$ . This scenario is easier for our methods, and the simulations demonstrate that they outperform both ridge regression and ordinary least squares (OLS). Finally, we choose a middle ground: we take $(\beta_{*})_{j}=(-1)^{j-1}.$ As in the first case, the coefficients cluster around zero, but here there is no randomness. In the second and third scenarios, we also use $\sigma=50$ .

Remark 1.

The first scenario allows for an easy comparison between our methods and the optimal algorithm, ridge regression. However, determining appropriate values for $\tau^{2}$ and $\sigma^{2}$ requires some care. Under this model, if $\tau^{2}$ is large, then $\lambda_{*}$ will be very small so that OLS and ridge regression are nearly equivalent and our methods will perform relatively poorly. The reason (see Section 6) is that compression tends to increase baseline variance relative to ridge regression and therefore requires more regularization and a greater increase in bias. On the other hand, if $\tau^{2}$ is small, then $\lambda_{*}\rightarrow\infty$ and we will end up shrinking all coefficient estimates to zero all the time. Compression will exacerbate this effect and will look artificially good: we should just use the zero estimator.

To solve this Goldilocks problem and to get nontrivial results in these simulations, we need to set $\sigma$ and $\tau$ just right. Note that these considerations wouldn’t be an issue in the high-dimensional, $p>n$ , case since the least squares solution is unavailable. Our choice, with $\lambda_{*}\approx 0.32$ , is intended to be somewhat neutral toward our methods.

We examine four different compressed estimators with penalization: (1) full compression, (2) partial compression, (3) a linear combination of the first two, and (4) a convex combination of the first two. We also use the OLS estimator and the ridge regression estimator. For ridge regression, we use $\lambda_{*}$ in the first scenario (and label it “Bayes”) and choose $\lambda$ by minimizing GCV in the other cases. We generate 50 training data sets as above with $n=5000$ in all simulation. For the compressed estimators, we examine three possible values of $q\in\{500,\ 1000,\ 1500\}$ . In each case, we generate $Q\in\{-1,0,1\}^{q\times n}$ as a “sparse Bernoulli” matrix with $s=3$ .

4.2 Estimation error simulations

For all four compressed methods, there exist $\lambda$ values which allow the compressed method to beat ordinary least squares. Some occasionally beat ridge regression as well depending on the structure of $\beta_{*}$ . Regularized compression always outperforms unregularized compression for most $\lambda>0$ . While we have simulated all combinations of $p$ and $\rho$ , we only display results for $p=100$ and $\rho=0.2$ which are typical. Figures 1 to 3 show boxplots of the results for each estimation method across replications. Figure 1 shows the case where $\beta_{*}$ is drawn from a Gaussian distribution while Figure 2 shows $\beta_{*}\in\{-1,1\}^{p}$ and Figure 3 shows $(\beta_{*})_{j}\equiv 1,\ \forall j$ . Within each figure, the three panels display different choices of compression parameter $q$ . The $x$ -axis shows values $\lambda$ with the far-right section giving results for OLS and ridge regression. Finally, the $y$ -axis is the logarithm of the mean squared estimation error.

Combining partial and full compression strictly dominates the other compressed methods in terms of estimation risk. Furthermore, for every choice of $q$ and $\beta$ , there is a $\lambda$ such that the linear combination has better estimation risk than OLS. When $p$ is small relative to $n$ and the design has low correlation, only the linear combination and the convex combination outperform OLS. To compare with the case that the design has larger correlation, we also show $p=100$ and $\rho=0.8$ with $(\beta_{*})_{j}\equiv 1$ (Figure 4). In this case, all of the compressed methods outperform OLS for most choices of $\lambda$ .

This story holds for other values of $p$ and $\rho$ : (1) regularized compression beats unregularized compression; (2) when the correlation of the design matrix is small, there is some level of regularization such that compression beats OLS; (3) when the correlation of the design matrix is large, compression beats OLS at nearly all levels of regularization; (4) the regularized combination estimators (linear and convex) are almost always the best; (5) these methods approach the accuracy of the optimal, full-data Bayes estimator in many cases.

Remark 2.

In Figure 3, the performance of the linear combination estimator is nearly independent of the choice of $\lambda$ , and does significantly better than the other methods. This behavior is either a feature or a curse of the method depending on the user’s perspective. Because both full compression and partial compression (as well as ridge regression) shrink coefficients to zero, when $\lambda$ is large, the unconstrained linear combination will try to use the data to combine two vectors, both of which are nearly zero, in some optimal way. As long as $\beta_{*}$ is approximately constant, and $\widehat{\alpha}$ is unconstrained, this method can compensate by choosing $\widehat{\alpha}$ . As $\lambda\rightarrow\infty$ , the solution of (15) will adjust and so $\widehat{\beta}$ will actually converge to $(\beta_{*})_{j}\equiv 1$ rather than 0 (as the ridge estimator does). So if the data analyst believes that $\beta_{*}$ is approximately constant, but nonzero, this method will work well, even without prior information for the particular constant. Otherwise, it is safer to use the convex combination which avoids this pathology. The lack of constraint for $\widehat{\alpha}$ also has implications for our approximate degress-of-freedom estimator in (19) which we discuss in greater detail in the next section.

4.3 Selecting tuning parameters

The previous simulation shows that regularized compression can outperform full-data OLS and compare favorably with the performance of the Bayes estimator as long as we can choose $\lambda$ well. In this section, we focus on the case where $\beta_{*}$ has a Gaussian distribution to investigate the empirical tuning parameter strategies.

We again generate a training set with $n=5000$ , but we also generate an independent test set with $n=5000$ . Then we use the training set to choose $\lambda_{GCV}$ by generalized cross validation as described in Section 3.1. We also define an optimal (though unavailable) $\lambda_{test}$ by minimizing the test set prediction error:

[TABLE]

Figure 5 shows the prediction risk of the GCV-selected estimates relative to the oracle. That is, we plot the ratio

[TABLE]

All methods are within 1% of the best test error the majority of the time, but the convex combination and partial compression tend do the best. Note that the minimum test error is for the particular method rather than relative to the best possible test error across all methods. Thus, while GCV selects the optimal tuning parameter for partial compression more accurately than for the linear combination, the linear combination has lower estimation error at its own GCV-selected tuning parameter.

While Section 3 presented two methods for estimating the degrees of freedom—a simple approximation as in equation (19) or the divergence—we only present the results for the approximation. As discussed in Section 3, using the divergence-based, degrees-of-freedom estimator requires generating an $n\times n$ matrix, which is computationally infeasible.

4.4 Overall performance assessments

Using the tuning parameter selected by GCV, compressed regression can perform better than uncompressed regression in some situations. Figure 6 examines the performance across all simulations. Specifically, for each of the 50 training data sets, we estimate the linear model with each method, choosing $\lambda$ by GCV if appropriate. We select the method with the lowest estimation error. We plot the proportion of times each method “wins” across all simulation conditions. Generating $\beta_{*}$ from a normal distribution favors ridge regression, as is to be expected, since this is the optimal full-data estimator. On the opposite extreme, when $(\beta_{*})_{j}\equiv 1$ , the unconstrained linear combination of compressed estimators is nearly always best. In the middle, each of the methods works some of the time. Overall, the convex combination estimator, that is the estimator given by (13), works well in most cases and has smaller variance than the unconstrained linear combination. Ordinary least squares almost never wins.

4.5 Recommendations

Even though ridge regression is optimal when $\beta_{*}$ has a Gaussian distribution, regularized compression can achieve nearly the same prediction error while using fewer computations and less storage space. Furthermore, the compressed estimators nearly always outperform ordinary least squares. Figure 7 displays the prediction risk for each method when the tuning parameter is chosen by GCV. The difference between the optimal model and the compressed approximations is negligible.

5 Real data examples

We present results for two different types of real data: a collection of genetics data and an astronomy dataset.

5.1 Genetics

The first data we examine are a collection of short-read RNA sequences. The data are publicly available111From Jun Li: http://www3.nd.edu/~jli9/mseq/data_top100.zip and were first examined by Li et al. (2010), who suggest a Poisson linear model to predict read counts based on the surrounding nucleotides. An implementation of this method is provided in the R package mseq, described by Li et al. (2010) and available from the CRAN archive.

In all, there are eight data files from three research groups. Three datasets are due to Mortazavi et al. (2008), which mapped mouse transcriptomes from brain, liver, and skeletal muscle tissues. Wang et al. (2008) collected data from 15 different human tissues which have been merged into three groups based on tissue similarities. Finally, Cloonan et al. (2008) examined RNA sequences from mouse embryonic stem cells and embryoid bodies. In all cases, we use the top 100 highly-expressed genes as well as the surrounding sequences to predict expression counts as in (Li et al., 2010). Table 1 shows the sample size $n$ for each of the eight data sets.

Following Li et al. (2010) and Dalpiaz et al. (2013), we examine each of these datasets separately. In order to build the model, we must select how many surrounding nucleotides to use for prediction. As nucleotides are factors (taking levels C,T,A,G), a window of $k$ surrounding nucleotides will give $p=3(k+1)+1$ predictors of which one is the intercept. We could also use dinucleotide pairs (or higher interactions), as in Li et al. (2010), resulting in $p=15(k+1)+1$ predictors. For our illustration, we follow Ma et al. (2015), who also apply different compressed linear regression methods to these data, and use $k=39$ .

The results of our analysis are shown in Figure 8. For each dataset, which we denote by the first letter of the senior author’s last name (W, B, and G respectively) followed by a number, we split the data randomly into 75% training data and 25% testing data. We then compress the training set using $q=10000$ and $q=20000$ . We apply each of the regularized compressed methods, choosing $\lambda$ by generalized cross validation and then evaluate the estimators by making predictions on the test set. We repeat this procedure 10 times and present the average of the log test error relative to OLS. Across data sets, $q=10000$ results in data reductions between 74% and 93% (meaning $0.26\geq q/n\geq 0.07$ ) while $q=20000$ gives reductions between 48% and 84%.

For these data, ridge and OLS give equivalent test set performance (differing by less than .001%) across all data sets. The similarity between these two estimators suggests that compression will be hard-pressed to yield improvements since the signal-to-noise ratio is so high. Additionally, previous analyses have found that the coefficients are roughly centered around zero with most quite small. While none of our methods are able to beat OLS, their performance is not much worse. The worst method is always full compression. The linear combination and the convex combination are nearly equivalent, while partial compression is just slightly worse. Even for full compression, its worst performance across all data sets and over both values of $q$ is less than 1.5% worse than OLS. So even for the worst performing method, large amounts of compression result in a negligible increase in test set error.

5.2 Galaxies

To create another data set with larger coefficients possibly not centered around zero, we examine a collection of galaxies and attempt to predict their redshifts, a measurement of how far the galaxy is from earth, from their flux measurements, the intensity of light at different wavelengths. We downloaded $n=5000$ random galaxies from the Sloan Digital Sky Survey 12 (Alam et al., 2015). Following Richards et al. (2009), we restricted our sample to contain only galaxies with estimated redshift $z>0.05$ and plate number less than 7000 (currently, larger plate numbers may contain galaxies without observed flux measurements). We also removed galaxies which have been flagged as having unreliable redshift estimates. While 5000 galaxies is quite a bit smaller than the genetics data we used above, we note that this is just a tiny subsample of the nearly 1.5 million currently available. Each galaxy has 3693 flux measurements measured at wavelengths between 3650Å and 10400Å, but following previous analyses, we truncate those below 4376Å and above 10247Å. Finally, because the measurements are quite noisy, we smooth the flux measurements using a regression spline with 125 equally spaced knots on the log scale. We use the 125 spline coefficients from the smoothed versions for all 5000 galaxies as our design matrix.

As above, we randomly divide the data into 75% training data and 25% testing data, estimate each method on the training set, and predict the test data. We repeat the experiment 10 times using $q=1000$ and $q=2000$ . The results are shown in Figure 9.

In this case, ridge regression has better performance than ordinary least squares on every replication, improving test error between 5 and 10%. None of the compressed methods do quite this well. When $q=1000$ , the convex combination and linear combination are generally between 3 and 8% worse than OLS. Interestingly, in this case, full compression is not much different, but partial compression is significantly worse with test error between about 16% and 21% worse than OLS. With $q=2000$ , the convex combination is again the best, with about half of the replications outperforming OLS, while the linear combination is not far behind. The convex combination had better test error than ridge regression in one of the ten replications.

6 Theoretical analysis

To develop a better understanding of the relationship between the compressed regression methods proposed here and standard full-data techniques, we derive expressions for the expectation and variance of full and partial compression estimators as well as their bias and variance. For comparison, we first present the standard analogues for ridge regression.

6.1 Standard results for ridge regression

Write the singular value decomposition of $\mathbb{X}=UDV^{\top}$ . Then define the ridge regression estimator of $\beta_{*}$ as in equation (7).

The bias and variance of the ridge regression estimator conditional on the design matrix are given in the first result.

Lemma 1.

[TABLE]

A standard corollary gives the squared bias and trace of the variance of the ridge regression estimator conditional on the design.

Corollary 2.

[TABLE]

In the next sections, we will derive approximations to these quantities for the fully and partially compressed ridge regression estimators of $\beta_{*}$ .

6.2 Mean and variance of the compressed estimators

Because all of our estimators depend on $(\mathbb{X}^{\top}Q^{\top}Q\mathbb{X}+\lambda I_{p})^{-1}$ , a generally intractable quantity, we derive approximate results via a first order Taylor expansion of the estimator with respect to the matrix $Q^{\top}Q$ . The proofs as well as intermediary results are included in the Supplementary Material.

Following Ma et al. (2015), we use the Taylor expansion of $\widehat{\beta}$ as a function of $A:=\frac{s}{q}Q^{\top}Q$ around $I_{n}$ to derive results conditional on $Y$ and $\mathbb{X}$ (taking expectations over $Q$ ) as well as results unconditional on $Y$ . The first case reflects the randomness in the compression algorithm relative to the more computationally demanding ridge regression. The second is useful for comparing the compressed procedures with ridge regression by including randomness introduced through the data generating process and through the compression algorithm. In all cases, these results are conditional on the design matrix as was the case above. For convenience of expression, define

[TABLE]

Finally, for these results we will assume that $q=cn$ for some $0<c\leq 1$ which is fixed. We discuss this assumption further in the remark below.

Theorem 3.

For full compression,

[TABLE]

where $\widehat{e}=(I-H)Y$ . Furthermore,

[TABLE]

where $\widehat{Y}=HY$ .

Theorem 4.

For partial compression,

[TABLE]

where $\widehat{Y}=HY$ . Additionally,

[TABLE]

Remark 3.

In each expression above, the Taylor series is valid whenever higher-order terms are small. In our case, the higher-order terms are $o_{P}(\left|\left|A-I\right|\right|^{2})$ under the expansion, for some matrix norm $\left|\left|\cdot\right|\right|$ . Here $o_{P}(\cdot)$ is with respect to the randomness in $A$ through $Q$ . As $Q$ is independent of the data, one can examine how large these deviations are likely to be. In particular, using results similar to the Tracy-Widom law (Rudelson and Vershynin, 2010, Proposition 2.4), one can show that $\left|\left|A-I\right|\right|=O_{P}(\sqrt{n/q})$ . Therefore, taking $q=cn$ for some $0<c\leq 1$ means that the remainder is $o_{P}(1)$ . This is in contrast with results of Ma et al. (2015), for two reasons: (1) their sampling mechanism is allowed to depend on the data where ours is not and (2) they can lose rank from the compression. In our case, $(\mathbb{X}^{\top}Q^{\top}Q\mathbb{X}+\lambda I)$ is full rank for all $\lambda>0$ regardless of the rank of $\mathbb{X}^{\top}Q^{\top}Q\mathbb{X}$ , so the Taylor series is always valid.

On average, both procedures are the same as ridge regression, up to higher order remainder terms. The variance, however, is larger for the same value of $\lambda$ . Finally, we note that these results make explicit another tradeoff between computation and estimation: taking $s\leq 2$ eliminates one term in the variance expansion completely at the expense of denser $Q$ .

In order to compare our procedure to ridge regression more directly, we examine the Taylor expansions of the mean squared error (rather than expanding the estimator and then taking expectations) in the next section.

6.3 Mean squared error performance

Rather than looking at the estimated coefficients, we could also look at the mean squared error (MSE) directly. If we ignore remainder terms, we can try to minimize the MSE over $\lambda$ and evaluate the quality of the resulting oracle estimator. As suggested in our simulations, there exists a $\lambda_{*}$ such that compressed ridge regression actually achieves lower MSE than ridge regression does.

Theorem 5.

The squared-bias of the fully compressed estimator is:

[TABLE]

The squared-bias of the partially compressed estimator is:

[TABLE]

Note that, ignoring the remainder, the squared bias of both fully compressed and partially compressed estimators when averaged over the compression is the same as that of ridge regression.

Theorem 6.

The variance of the fully compressed estimator is:

[TABLE]

The variance of the partially compressed estimator is:

[TABLE]

Corollary 7.

Suppose $\mathbb{X}$ is such that $\mathbb{X}^{\top}\mathbb{X}=nI_{p}$ , $b^{2}:=\left|\left|\beta_{*}\right|\right|_{2}^{2}$ , and $\theta:=\lambda/n$ . Then

[TABLE]

Hence, for ridge, the optimal $\theta_{*}=\sigma^{2}p/(nb^{2})$ and $\lambda_{*}=\sigma^{2}p/b^{2}$ . For the other methods, the MSE can be minimized numerically, but we have, so far, been unable to find an analytic expression.

7 Conclusion

In this paper, we propose and explore a broad family of compressed, regularized linear model estimators created by generalizing a commonly used approximation procedure which we notate $\widehat{\beta}_{FC}$ (defined in equation 3). We show that $\widehat{\beta}_{FC}$ must indeed perform worse than the least squares solution. We suggest using $\widehat{\beta}_{\alpha}(\lambda)$ , defined in equation (12) instead. As $\widehat{\beta}_{\alpha}(\lambda)$ has a tuning parameter $\lambda$ , we discuss methods for choosing it in a data dependent and computationally feasible way.

We find that, in particular cases, $\widehat{\beta}_{\alpha}(\lambda)$ can perform better than full-data OLS and full-data ridge regression, and that this statistical performance gain is accompanied by a computational one. Admittedly, $\widehat{\beta}_{\alpha}(\lambda)$ does not perform as well in our real data examples, however, it appears that these data sets have relatively low variance, and hence, including bias does not improve performance as much as if there were larger variance.

Interesting future work would examine the divergence based tuning parameter selection method. This approach, while promising, seemingly requires the computation of the dense, large matrix $Q^{\top}Q$ , and hence, is computationally infeasible. Additionally, other forms for the compression matrix $Q$ should be investigated to examine their statistical impact. Finally, while ridge penalties are amenable to theoretical analysis because of their closed form solution, it is worthwhile to examine other penalty functions and other losses, such as generalized linear models.

Appendix A Supplementary material

A.1 Background results

Lemma 8.

[TABLE]

where $e_{j}$ is an $n^{2}$ vector with a 1 in the $j^{th}$ position and zeros otherwise.

Lemma 9.

[TABLE]

where $E_{jj}$ is the $n\times n$ matrix with a 1 in the $jj$ entry and zeros otherwise.

Proof.

[TABLE]

∎

Corollary 10.

If $b$ is a vector, then

[TABLE]

Lemma 11.

Let $\widehat{e}=(I_{n}-H)Y$ for $H=\mathbb{X}(\mathbb{X}^{\top}\mathbb{X}+\lambda I_{p})\mathbb{X}^{\top}=UD(D^{2}+\lambda I_{p})^{-1}DU^{\top}$ . Then,

[TABLE]

Lemma 12.

Let $\widehat{Y}=HY$ . Then,

[TABLE]

Appendix B Proofs

Lemma 13.

For $q_{ij}$ distributed independently as sparse Bernoulli random variables, we have

[TABLE]

where $K_{nn}$ is the $(nn,nn)$ commutation matrix: $K_{nn}$ is the unique matrix such that, for any $n\times n$ matrix $A$ , $\operatorname{vec}\left(A^{\top}\right)=K_{nn}\operatorname{vec}\left(A\right)$ (see Magnus and Neudecker, 1979).

Proof of Appendix B.

We have

[TABLE]

Now, for the variance, note that each element of $\operatorname{vec}\left(Q^{\top}Q\right)$ is equal to $q_{i}^{\top}q_{j}$ for appropriate $i,\ j\in\{1,\ldots,n\}$ . Therefore, the $(ij,kl)$ -element of the $(n^{2}\times n^{2})$ -variance matrix is given by $\mathbb{E}[q_{i}^{\top}q_{j}q_{\ell}^{\top}q_{m}]-(\mathbb{E}[Q^{\top}Q])_{i,j}(\mathbb{E}[Q^{\top}Q])_{\ell,m}.$

[TABLE]

Properties of the commutation matrix give the result. ∎

Lemma 14.

For $A$ fixed, the first order Taylor expansion of the fully and partially compressed regression estimators are

[TABLE]

where $\widehat{e}=(I_{n}-H)Y$ .

Proof of Appendix B.

We begin with the fully compressed estimator.

[TABLE]

where $\operatorname{vec}\left(A\right)=\begin{bmatrix}a_{11}&a_{21}&\cdots&a_{nn}\end{bmatrix}^{\top}$ . Then,

[TABLE]

Here we need two results:

[TABLE]

See (Harville, 1997, (16.6.8) and (16.6.15) respectively). Therefore,

[TABLE]

and

[TABLE]

Therefore,

[TABLE]

Finally,

[TABLE]

where the third equality follows from the result $(B^{\top}\otimes A)\operatorname{vec}\left(X\right)=\operatorname{vec}\left(AXB\right)$ .

The result for the partially compressed estimator proceeds similarly.

[TABLE]

Then, as above,

[TABLE]

Finally,

[TABLE]

∎

Proof of Theorem 3.

As $\mathbb{E}_{Q}[A]=I$ , the conditional expectation is straightforward. For the conditional variance,

[TABLE]

For the unconditional case, taking the expectation of the conditional expectation with respect to $Y$ gives the first result. The second follows from the formula for total variance and Section A.1. ∎

Proof of Theorem 4.

Borrowing the notation from the previous result,

[TABLE]

As before, taking the expectation of the conditional expectation with respect to $Y$ gives the first result. The second follows from the formula for total variance and Section A.1. ∎

The following two lemmas follow from standard properties of expectation and variance of linear estimators, and their proofs are omitted.

Lemma 15.

The expectation and variance of the fully compressed regression estimator, conditional on $Q$ and the design, are

[TABLE]

Lemma 16.

The expectation and variance of the partially compressed regression estimator, conditional on $Q$ and the design, are

[TABLE]

We now use these lemmas to derive the following theorems for full and partial compression.

Proof of Theorem 5.

The squared bias for each estimator is

[TABLE]

Note that this is a function of $A$ , and the expectation is over $y$ (and therefore $\epsilon$ ). We suppress the conditioning of all expectations on $\mathbb{X}$ and $Q$ . Taking a Taylor expansion to first order,

[TABLE]

We have that

[TABLE]

For the fully pre-conditioned estimator,

[TABLE]

Proceeding as in the proof of Appendix B,

[TABLE]

and

[TABLE]

Therefore,

[TABLE]

Evaluating

[TABLE]

at $A=I_{n}$ and simplifying gives

[TABLE]

We can simplify this expression using the following result from (Harville, 1997, 16.2.15):

[TABLE]

Therefore,

[TABLE]

Taking the expectation with respect to $Q$ gives the second result.

For the case of partial pre-conditioning,

[TABLE]

Proceeding as above and evaluating at $A=I_{n}$ ,

[TABLE]

∎

Proof of Theorem 6.

As for the trace of the variance, the Taylor expansion is given by

[TABLE]

For any matrix $B$ , we have $\operatorname{tr}(B^{\top}B)=\operatorname{vec}\left(B\right)^{\top}\operatorname{vec}\left(B\right)$ . Therefore,

[TABLE]

which both have this form. We begin with fully compressed.

[TABLE]

Now, we have

[TABLE]

where

[TABLE]

Combining these gives

[TABLE]

Evaluating at $A=I_{n}$ and simplifying gives

[TABLE]

For the case of partial compression, we only need

[TABLE]

Evaluating at $A=I_{n}$ and simplifying gives

[TABLE]

For the variances unconditional on $Q$ , note that by the law of total variance,

[TABLE]

By the conditional part of this theorem. Now, we first expand $\mathbb{E}[\widehat{\beta}\mbox{ }|\mbox{ }\mathbb{X},Q]$ , relying on previous results:

[TABLE]

As we will be taking the variance with respect to $A$ , the first term is constant. For the second term we have,

[TABLE]

Therefore,

[TABLE]

In both cases, we need $\operatorname{tr}(\mathbb{V}[LAz])$ for a matrix $L$ and vector $z$ . As this is a vector, we have that $LAz=(z^{\top}\otimes L)\operatorname{vec}\left(A\right)$ , so

[TABLE]

Now applying the trace,

[TABLE]

Therefore,

[TABLE]

Using the SVD of $\mathbb{X}$ gives,

[TABLE]

Therefore,

[TABLE]

∎

Proof of Section 6.3.

For full compression we have,

[TABLE]

For partial compression we have,

[TABLE]

Simplifying and substituting $\theta=\lambda/n$ gives the result. ∎

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Achlioptas (2003) Achlioptas, D. (2003), ‘Database-friendly random projections: Johnson-Lindenstrauss with binary coins’, Journal of Computer and System Sciences 66 (4), 671–687.
3Alam et al. (2015) Alam, S., Albareti, F. D., Prieto, C. A., Anders, F., Anderson, S. F., Anderton, T., Andrews, B. H., Armengaud, E., Aubourg, E., Bailey, S., Basu, S. et al. (2015), ‘The eleventh and twelfth data releases of the Sloan Digital Sky Survey: Final data from SDSS-III’, The Astrophysical Journal Supplement Series 219 (1), 12.
4Avron et al. (2010) Avron, H., Maymounkov, P. and Toledo, S. (2010), ‘Blendenpik: Supercharging lapack’s least-squares solver’, SIAM Journal on Scientific Computing 32 (3), 1217–1236.
5Becker et al. (2017) Becker, S., Kawas, B., Petrik, M. and Ramamurthy, K. N. (2017), Robust partially-compressed least-squares, in ‘The Thirty-First AAAI Conference on Artificial Intelligence’.
6Cloonan et al. (2008) Cloonan, N., Forrest, A. R. R., Kolle, G., Gardiner, B. B. A., Faulkner, G. J., Brown, M. K., Taylor, D. F., Steptoe, A. L., Wani, S., Bethel, G., Robertson, A. J., Perkins, A. C., Bruce, S. J., Lee, C. C., Ranade, S. S., Peckham, H. E., Manning, J. M., Mc Kernan, K. J. and Grimmond, S. M. (2008), ‘Stem cell transcriptome profiling via massive-scale m RNA sequencing’, Nature Methods 5 (7), 613–619.
7Dalpiaz et al. (2013) Dalpiaz, D., He, X. and Ma, P. (2013), ‘Bias correction in RNA-Seq short-read counts using penalized regression’, Statistics in Biosciences 5 (1), 88–99.
8Dasgupta et al. (2010) Dasgupta, A., Kumar, R. and Sarlós, T. (2010), A sparse Johnson-Lindenstrauss transform, in ‘Proceedings of the 42nd ACM Symposium on Theory of Computing’, ACM, pp. 341–350.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Compressed and Penalized Linear Regression

Abstract

1 Introduction

1.1 Overview of the problem

1.2 Prior work

1.3 Our contribution

2 Compressed regression

2.1 Compressed least squares regression

2.2 Compressed ridge regression

2.3 Linear combination compressed ridge regression

2.4 Compression matrices

3 Tuning Parameter Selection

3.1 Estimating the degrees of freedom

3.2 Computing the path

4 Simulations

4.1 Setup

Remark 1**.**

4.2 Estimation error simulations

Remark 2**.**

4.3 Selecting tuning parameters

4.4 Overall performance assessments

4.5 Recommendations

5 Real data examples

5.1 Genetics

5.2 Galaxies

6 Theoretical analysis

6.1 Standard results for ridge regression

Lemma 1**.**

Corollary 2**.**

6.2 Mean and variance of the compressed estimators

Theorem 3**.**

Theorem 4**.**

Remark 3**.**

6.3 Mean squared error performance

Theorem 5**.**

Theorem 6**.**

Corollary 7**.**

7 Conclusion

Appendix A Supplementary material

A.1 Background results

Lemma 8**.**

Lemma 9**.**

Proof.

Corollary 10**.**

Lemma 11**.**

Lemma 12**.**

Appendix B Proofs

Lemma 13**.**

Proof of Appendix B.

Lemma 14**.**

Proof of Appendix B.

Proof of Theorem 3.

Proof of Theorem 4.

Lemma 15**.**

Lemma 16**.**

Proof of Theorem 5.

Proof of Theorem 6.

Proof of Section 6.3.

Remark 1.

Remark 2.

Lemma 1.

Corollary 2.

Theorem 3.

Theorem 4.

Remark 3.

Theorem 5.

Theorem 6.

Corollary 7.

Lemma 8.

Lemma 9.

Corollary 10.

Lemma 11.

Lemma 12.

Lemma 13.

Lemma 14.

Lemma 15.

Lemma 16.