Quickly Finding the Best Linear Model in High Dimensions
Yahya Sattar, Samet Oymak

TL;DR
This paper introduces a projected gradient descent algorithm for efficiently finding the optimal linear model in high-dimensional settings, with theoretical guarantees and practical validation.
Contribution
It presents a novel PGD method with convergence and error bounds applicable to heavy-tailed distributions, without assuming realizability, and includes bias learning augmentation.
Findings
Linear convergence rate established for PGD.
Effective in heavy-tailed sub-exponential distributions.
Numerical experiments confirm theoretical predictions.
Abstract
We study the problem of finding the best linear model that can minimize least-squares loss given a data-set. While this problem is trivial in the low dimensional regime, it becomes more interesting in high dimensions where the population minimizer is assumed to lie on a manifold such as sparse vectors. We propose projected gradient descent (PGD) algorithm to estimate the population minimizer in the finite sample regime. We establish linear convergence rate and data dependent estimation error bounds for PGD. Our contributions include: 1) The results are established for heavier tailed sub-exponential distributions besides sub-gaussian. 2) We directly analyze the empirical risk minimization and do not require a realizable model that connects input data and labels. 3) Our PGD algorithm is augmented to learn the bias terms which boosts the performance. The numerical experiments validate our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Statistical Methods and Inference · Stochastic Gradient Optimization Techniques
Quickly Finding the Best Linear Model in High Dimensions
Yahya Sattar Samet Oymak Department of Electrical and Computer Engineering, University of California, Riverside, CA 92521, USA. Email: [email protected], [email protected].
Abstract
We study the problem of finding the best linear model that can minimize least-squares loss given a dataset. While this problem is trivial in the low-dimensional regime, it becomes more interesting in high-dimensions where the population minimizer is assumed to lie on a manifold such as sparse vectors. We propose projected gradient descent (PGD) algorithm to estimate the population minimizer in the finite sample regime. We establish linear convergence rate and data-dependent estimation error bounds for PGD. Our contributions include: 1) The results are established for heavier tailed sub-exponential distributions besides sub-gaussian. 2) We directly analyze the empirical risk minimization and do not require a realizable model that connects input data and labels. 3) Our PGD algorithm is augmented to learn the bias terms which boosts the performance. The numerical experiments validate our theoretical results.
Index Terms:
high-dimensional estimation, projected gradient descent, one-bit compressed sensing, gaussian width.
I Introduction
Supervised learning is concerned with finding a relation between the input-output pairs . The simplest relations are linear functions where the output is estimated by a linear function of the input, that is, . Using quadratic loss, we can find the optimal with a simple linear regression which minimizes.
[TABLE]
If the samples are i.i.d. and input has identity covariance, the population minimizer () is simply given by
[TABLE]
where is drawn from same distribution as data. In many applications, we operate in the high-dimensional regime where we have fewer samples than the parameter dimension i.e. . In this case, the problem is ill-posed; however, if lies on a low-dimensional manifold, we can take advantage of this information to solve the problem. We assume is structured-sparse, for instance, it can be a signal that is sparse in a dictionary or it can be a low-rank matrix. If is a regularization function that promotes this structure, we can solve the regularized empirical risk minimization (ERM)
[TABLE]
where are the output labels and data matrix respectively. This problem is well-studied in the statistics and compressed sensing (CS) literature. However, much of the theory literature is concerned with the scenario where the problem is realizable i.e. the outputs are explicitly generated with respect to some ground truth vector . In the simplest scenario, input/output relation can be where is independent zero-mean noise vector. In this case, one simply has . Such realizability assumption is also common in the single-index models [1, 2]. One contribution of this paper will be analyzing regularized ERM without the realizability assumption.
Bias in the data can negatively affect the estimation quality. Assuming input is zero-mean, instead of solving (1) we can solve a modified problem which accounts for the mean of the output as well. Again, denoting the regularization function by , we will solve the modified problem
[TABLE]
where the loss is given by {\cal{L}}(\bm{\theta},\mu)=\frac{1}{2}\big{\|}\bm{y}-[{\bm{X}}~{}\mathbf{1}]\begin{bmatrix}\bm{\theta}\\ \mu\end{bmatrix}\big{\|}_{\ell_{2}}^{2}. We will show that solving problem (2) is essentially equivalent to solving (1) with debiased output hence it will result in more accurate estimation. The goal of this paper is studying problem (2) under a general algorithmic framework, establishing finite-sample statistical and algorithmic convergence, and addressing practical considerations on the data distribution. In particular, we are interested in how well one can estimate the best linear model (BLM) given by the pair . For estimation, we will utilize the projected gradient descent algorithm given by the iterates
[TABLE]
where projects onto the constraint set \mathcal{K}=\{\bm{\theta}\in\mathbb{R}^{p}{~{}\big{|}~{}}\mathcal{R}(\bm{\theta})\leq R\} and is the step size.
I-A Relation to Prior Work
There is a significant amount of literature on nonlinear (or one-bit) CS [3, 4, 5, 6, 7, 2, 8, 9, 10, 11, 12]. [13, 4, 14, 15, 16] study algorithmic and statistical convergence rates for first order methods such as projected/proximal gradient descent. For nonlinear CS, [17, 4, 7, 5] provide statistical analysis of single index estimation with a focus on Gaussian data. Recently, one-bit CS techniques have been extended to sub-gaussian distributions using dithering trick which adds noise before quantization [18, 19, 20, 21]. Dithering is introduced to guarantee consistent estimation of the ground-truth parameter. The papers [22, 23, 24, 25, 26] address non-gaussianity by utilizing Stein identity which requires access to the distribution of the input samples. Closer to us [27] studies the constrained empirical risk minimization with linear functions and squared loss with a focus on convex problems. In comparison our analysis applies to a broader class of distributions and focus on first order algorithms. Much of our analysis focuses on addressing subexponential samples, which requires tools from high-dimensional probability [28, 29].
Our results apply to general regularizers and borrow ideas from [4, 5, 6, 7]. Similar to these, we view the nonlinearity between input and output as an additive noise. The convergence analysis of projected gradient descent is a rather well-understood topic and we utilize insights from [15, 14, 13, 16] for our analysis.
I-B Contributions
At a high-level our work has three distinguishing features compared to the prior literature.
Subexponential samples: Most nonlinear CS results apply to Gaussian or subgaussian data when dithering trick is utilized [18, 19, 20, 21]. We take advantage of the recent techniques for subexponential distributions to provide statistical/computational guarantees for heavier-tailed distributions.
No realizability assumption: Nonlinear CS literature is typically concerned with a ground-truth vector to be recovered. For instance, one-bit CS aims to learn from samples of type . Unlike these, we do not enforce such relationship to exist between input and output, hence the results apply under much weaker assumptions. Instead of a ground-truth , we work with the population BLM . However, can be shown to coincide with ground truth when it exists, if the input distribution is nice (e.g. Gaussian) [4, 5, 6, 7].
Bias estimation: Our analysis addresses the bias in the output by solving the modified problem (2). We show that (2) can be studied in a similar fashion to (1) by studying the statistical properties of the concatenated data matrix. However, empirically this modification results in a substantial improvement in estimation.
I-C Paper Organization
We review mathematical background and formulate the problem in Section II. We introduce our main results on statistical and computational convergence guarantees in Section III. Section IV provides numerical experiments to corroborate our theoretical results. Proofs of the main results are provided in Section V and finally the concluding remarks are made in Section VI.
II Preliminaries and Problem Formulation
In this section we introduce statistical quantities which are utilized to characterize the benefits of the regularization .
We first set the notation. denote positive absolute constants. For a vector , we denote its Euclidean norm by and its norm by . Similarly for a matrix , we denote its spectral norm by . Given a set , let {{\text{{{\bf{\text{cl}}}}}}}(S) and {{\text{{{\bf{\text{clconv}}}}}}}(S) be the minimal closed set and minimal closed-convex set containing respectively. Let denote the set radius . For closed sets, let be the projection operator defined as . denotes the normal distribution and denote the unit ball in . is the all ones vector of proper dimension. We will use and for inequalities that hold up to a constant factor.
Suppose we are given i.i.d. samples . To keep the exposition clean, we assume that is whitened, that is, it has zero-mean and identity covariance. We will aim to find a linear relation between the modified input-output pairs . Let us consider the statistical properties of our modified estimate in the population limit which is given by
[TABLE]
Thus, in the limiting case, captures the mean of the output and is the ideal solution of the problem with debiased output. Our goal is estimating the population minimizer ; which minimizes the expected quadratic loss . As discussed in Section I, assuming is structured sparse, we consider a non-asymptotic estimation of via problem (2). To proceed with analysis, set
[TABLE]
We investigate the PGD algorithm (3) which can be written as
[TABLE]
where is a fixed learning rate and is the modified data matrix constructed as follows
[TABLE]
Following [4, 30] PGD analysis can be related to the tangent ball around the population parameter which is given by
[TABLE]
Similarly, we define the extended tangent ball as follows
[TABLE]
The two definitions above are closely related. For any vector , we have that for . In the following we will express the convergence rates and residual errors of the PGD algorithm (3) in terms of the statistical properties of the tangent balls .
Technical approach: Denoting the parameter estimation error in (6) by and the effective noise by , the PGD update can be shown to obey [14] (see Eq. (VI.10))
[TABLE]
where is a numerical constant which is equal to for convex regularizer and for arbitrary and
[TABLE]
Here captures the algorithmic convergence and captures the statistical accuracy in terms of regularization. To achieve statistical learning bounds, we need to characterize the quantities above in finite sample. Existing literature provides a fairly good understanding of the related terms when has subgaussian rows or is independent of . The technical contributions of this work are i) extending these results to subexponential samples, ii) allowing for nonlinear dependencies between the noise and data, and iii) addressing the bias term by studying the concatenated matrix . To proceed with statistical analysis, we introduce Gaussian width.
Definition II.1** ((Perturbed) Gaussian width [29])**
The Gaussian width of a set is defined as
[TABLE]
Let be an absolute constant. Given an integer , the perturbed Gaussian width of is defined as
[TABLE]
where is Talagrand’s -functional (see [28]) with -metric.
Gaussian width helps to quantify the complexity of the regularized problem and determines the sample complexity of the linear inverse problems i.e. high-dimensional problems become manageable in the regime [31, 30]. Perturbed width is introduced more recently in [29] to address subexponential samples. [29] shows that, for standard regularizers such as , subspace, and rank constraints, we have that
[TABLE]
in the interesting regime . Hence, perturbed width has the same statistical accuracy of Gaussian width but applies to subexponential samples.
As illustrated in Table I, square of the Gaussian width captures the degrees of freedom for practical regularizers. Table I is obtained by setting in (4). In practice, a good choice for can be found by using cross validation. It is also known that the performance of PGD is robust to choice of (see Thm 2.6 of [14]).
[TABLE]
The next statistical quantity required in our analysis is the Orlicz norm defined as.
Definition II.2** (Orlicz norms)**
For a scalar random variable Orlicz- norm is defined as
[TABLE]
Orlicz- norm of a vector is defined as . Subexponential and subgaussian norms are special cases of Orlicz- norm given by and respectively.
Based on perturbed Gaussian width definition, we will show that one can upper bound the critical quantities (10) and (11). In return, this will reveal the statistical and computational performance of the PGD algorithm. This is the topic of the next section which states our main results.
III Main Results
In this section we estimate the convergence rate and the statistical accuracy of the PGD algorithm as a function of sample size, complexity of the parameter (e.g. sparsity level), and the distribution of the data (whether subgaussian or subexponential). Our main theorem establishes a linear convergence rate of PGD and shows that PGD achieves statistically efficient error rates. We first describe the data model.
Definition III.1** (Isotropic vector)**
* is called an isotropic Orlicz- vector if it is zero-mean with identity covariance and if its Orlicz- norm is bounded by an absolute constant.*
Definition III.2** (-noisy datasets)**
We assume the samples . We call a dataset -Orlicz- if the input samples are isotropic Orlicz-a vectors and the residual at the ground truth obeys
[TABLE]
We call -Orlicz- dataset -subexponential and -Orlicz- dataset -subgaussian.
Note that residual at the ground truth is the noise in our problem which may be function of the nonlinearity. Our main results capture the PGD performance for different dataset models.
Theorem III.3** (Subgaussian)**
Suppose is a -subgaussian dataset. Assume and set learning rate . Let be an arbitrary regularizer. Starting form any initial estimate , with probability at least , all PGD iterates (6) obeys
[TABLE]
Similarly, for subexponential samples, we have the following theorem which applies to convex regularizers.
Theorem III.4** (Subexponential)**
Suppose is a -subexponential dataset. Set . Set learning rate , suppose is convex and . Starting from initialization , with probability at least , all PGD iterates (6) obey
[TABLE]
Both of these results show that PGD iterates converge to population parameters at a linear rate. Subexponential theorem requires a more conservative choice of learning rate. The statistical estimation error grows as for subgaussian and for subexponential. Since our results apply in the regime , following (12), statistical errors associated with subgaussian and subexponential are same up to a constant for typical regularizers.
Our main results follow from Theorems III.5 and III.6 which are the topics of the following sections.
III-A Controlling the Convergence Rate of PGD
In this section, we study the convergence rate characterized by the term. The challenges we address are (i) characterizing the restricted singular values of the subexponential data matrices and (ii) addressing the concatenated all ones vector.
Theorem III.5** (Convergence rate)**
Suppose is a -subgaussian dataset and is the modified-data matrix, where is a vector of all ones. Let and be the tangent balls as defined in (7) and (8) respectively. Assume . Setting , with probability at least we have
[TABLE]
If the dataset is -subexponential, then setting and assuming , with probability , we have
[TABLE]
Note that, subexponential requires a smaller choice of learning rate which results in slower convergence.
III-B Bounding the Error due to Nonlinearity
Next, we provide a bound on the effective noise level ; which is crucial for assessing statistical accuracy. This term arises from the nonlinearity and noise associated with the relation between input and output. For example, for single-index models, we have \operatorname{\mathbb{E}}[y{~{}\big{|}~{}}\bm{x}]=\phi(\bm{x}^{T}\bm{\theta}_{\text{GT}}) for some link function and ground truth , and becomes the source of the nonlinearity. Our approach is similar to [4, 5, 27, 6, 7] and treats the nonlinearity as a noise. The finite sample noise is captured by the residual vector
[TABLE]
Following term in (11), the contribution of the residual to the estimated parameter is captured by the vector
[TABLE]
Our key observation is that the properties of can be characterized under fairly general assumptions compared to the existing literature; which is mostly restricted to zero-mean subgaussian samples.
Theorem III.6** (Statistical error)**
Suppose is a -subgaussian dataset. Let the tangent balls and be as defined in (7) and (8) respectively. Assume . Then, with probability at least , we have
[TABLE]
where is the effective noise given by (11). If is a -subexponential dataset and , with probability at least , we have
[TABLE]
This theorem establishes the crucial finite sample upper bounds on for both subgaussian and subexponential data as a function of Gaussian width of the tangent ball. Combining our bounds on and and utilizing the recursion (9), we can obtain the PGD convergence characteristics and prove the main theorems.
IV Numerical Experiments
In this section, we discuss experiments that corroborate our theoretical results. We consider a standard single-index model where for some ground truth vector and link function , the input/output relation is given by . We pick to be a sparse vector with nonzeros and and set sample size to be . Because of sparsity prior, we run PGD as iterative hard thresholding where is projected to be -sparse after every iteration. As link functions, we considered ReLU (i.e. ) and sign functions (maps to ); which are of interest for deep learning and quantization respectively. We generate ’s with i.i.d. exponentially distributed entries (with parameter ) and then remove the mean and normalize the covariance to identity. We pick a learning rate of in all experiments. The shaded areas in the plots correspond to one standard deviation.
To assess test and training performance of PGD, we use the following three metrics:
- •
the normalized training error defined as ,
- •
the normalized test error that is similarly defined but evaluated on a fresh dataset of size using the training model ,
- •
correlation to ground truth vector defined as .
We compare two baselines. First one is running PGD with and separately. Second one assumes knowledge of ground truth and fits a model by finding to minimize the training loss. Numerically, we minimize over where . This sets .
Figure 1 plots the loss as a function of the PGD iterations . Both training and test errors gracefully decays with more iterations for both choices of link functions. The dashed values corresponds to ’s performance. While there is a slight mismatch between train/test performances (due to finite samples), high-dimensional estimation via PGD works well and performs on par with ground truth. Observe that for ReLU, is nonzero and estimating mean should be beneficial. Indeed, Figures 1(c) and 1(d) demonstrates that substantially outperforms using alone. There is no improvement for sign function since .
In Figure 2 we focus on the parameter estimation question by plotting the correlation between and . Correlation is always between and quantifies how well we can estimate direction of the ground truth vector via PGD. This experiment is conducted with two values of namely and while in both cases. Observe that, a larger sample size results in more stable estimation (smaller standard deviations) and higher correlation with output. Additionally Figure 2(d) shows that ReLU problem achieves better correlation once we account for the bias term. Hence, mean estimation is not only beneficial for test performance but also for parameter estimation.
V Proofs of Main Theorems
This section proves our main results and outlines the proofs of Theorems III.3, III.4, III.5 and III.6. Throughout, we use the same notation as described in II.
V-A Proof of Theorem III.4
We provide our analysis for subexponential samples. The extension to subgaussian samples is accomplished in an identical fashion. Set the estimation error at iteration to be . Note that, when and is a convex regularizer, then the recursion (9) can be iteratively expanded as
[TABLE]
With the advertised probability, subexponential statements of Theorems III.5 and III.6 hold. Hence, for some constants, we have that , and with . Plugging these in (15), we find the following upper bound on the right hand side,
[TABLE]
which is the desired bound. The case of subgaussian samples is again a corollary of Theorems III.5 and III.6. This concludes the proof of our main result.
V-B Proof of Theorem III.5 for subgaussian samples
We start our proof with the following lemma.
Lemma V.1
Let be i.i.d. isotropic subgaussian samples. Let be concatenated data and is the modified-data matrix, where is a vector of all ones. Let be a closed set with Euclidian radius bounded by a constant and
[TABLE]
where for some positive constants and . Assume . Then, with probability at least we have
[TABLE]
The proof of Lemma V.1 is deferred to Section VII-A. Next using the result of Lemma V.1, we obtain the following lemma which bounds the convergence rate for subgaussian samples.
Lemma V.2
Consider the setup of Lemma V.1. Furthermore, let the tangent balls and be as defined in (7) and (8) respectively. Following Lemma V.1, with probability at least , the following holds
[TABLE]
The proof of Lemma V.2 is deferred to Section VII-B. This completes the proof for subgaussian samples.
V-C Proof of Theorem III.5 for subexponential samples
Let be i.i.d. isotropic subexponential vectors and be the associated design matrix as previously. Let and be as defined in 7 and 8 respectively. Assume . Our proof strategy is based on the observation that, we can bound the (restricted) singular values of with high probability for subexponential data as follows.
V-C1 Upper bounding the singular values
In this section we will upper bound the largest eigenvalue of the matrix with high probability. Towards this goal, we utilize Matrix Chernoff bound from [32].
Theorem V.3** (Matrix Chernoff [32])**
Consider a finite sequence of independent, random, Hermitian matrices with common dimension . Assume that
[TABLE]
Define the sum and let be an upper bound on the spectral norm of the expectation i.e. . We have that
[TABLE]
We will use Theorem V.3 to bound the largest eigenvalue of . Observe that
[TABLE]
Clearly this matrix is positive semidefinite. To bound , we use the following lemma.
Lemma V.4** (Spectral norm bound)**
Let be i.i.d. isotropic subexponential samples in . Then, with probability at least the spectral norm of all matrices can be bounded as
[TABLE]
The proof of lemma V.4 is deferred to Section VII-C. Lemma V.4 guarantees that . Hence, we do satisfy the conditions required by Theorem V.3. Before using Theorem V.3 we will upper bound the spectral norm of the expectation as follows.
Lemma V.5** (Spectral norm bound of expectation)**
Let be an isotropic subexponential vector, and let for sufficiently large constant . Then we have
[TABLE]
The proof of Lemma V.5 is deferred to Section VII-D. Thus, applying Lemma V.5 on the set of all satisfying , we find that with probability the following holds
[TABLE]
Hence, we can pick to upper bound the largest eigenvalue of . Now, using Theorem V.3 with and we get
[TABLE]
Union bounding, with probability at least ,
[TABLE]
V-C2 Lower bounding the singular values
In this section we will lower bound the gain of restricted to the tangent ball . We will utilize the notion of restricted singular value (RSV) to proceed.
Definition V.6** (Restricted singular value)**
Given a matrix and a closed set , the RSV of at is defined as
[TABLE]
In the following, we will lower bound which is the RSV of at . Recall that any with unit Euclidian norm obeys for and . Consequently
[TABLE]
Setting and minimizing both sides over , we get
[TABLE]
In essence, (18) bounds RSV of in terms of the RSV of and some simpler terms. The following theorem from [29] (Theorem D.11) gives a lower lower bound on the RSV of a matrix with i.i.d. subexponential rows.
Theorem V.7** (Bounding RSV [29])**
Let be a random matrix with i.i.d. isotropic subexponential rows. Let be a tangent ball as in (7) and suppose the sample size obeys . Then with probability at least , we have that
[TABLE]
Next, we shall state a lemma from [29] (Lemma D.7) to upper bound the term involving the sample average .
Lemma V.8** (Bounding empirical width [29])**
Suppose is a subset of the unit Euclidian ball and are i.i.d. zero-mean vectors with bounded subexponential norm. Define the empirical average vector . We have that
[TABLE]
Combining Theorem V.7 and Lemma V.8 into (18) we find that, there exist constants such that with probability at least , we can lower bound the RSV of as,
[TABLE]
where last line follows from the assumption that .
V-C3 Upper bounding the convergence rate
Union bounding the events (17) and (19), we obtain upper and lower bounds on the singular values of with the desired probability. Hence, we can bound the convergence rate of PGD as follows. Setting , we have (17) . Therefore, choosing learning rate , the matrix is positive semidefinite (PSD). Hence, applying the generalized Cauchy-Schwarz inequality for PSD matrix, we find
[TABLE]
Here the last inequality follows from (19). This completes the proof for subexponential samples.
V-D Proof of Theorem III.6 for subgaussian samples
Suppose the dataset is -subgaussian. Let and be as defined in Section II, recall from (13) and assume . Representing as for and , we have
[TABLE]
In the following we will upper bound the terms and separately and will combine them to get an upper bound on the residual error.
V-D1 Upper bounding the first term in (20)
In order to upper bound the first term in (20), define the clipping function
[TABLE]
The following lemma immediately follows from union bounding the large deviations of subgaussian and subexponential variables and shows that with high probability.
Lemma V.9
Let be i.i.d. subgaussian random variables with . There exists a constant such that picking , with probability for all , we have
[TABLE]
If instead are i.i.d. subexponential with , then picking leads to the same result.
Using Lemma V.9, with probability . Conditioned on this event, we have
[TABLE]
Setting , (21) can be re-written as
[TABLE]
Note that is subgaussian since is bounded. The subgaussian norm obeys
[TABLE]
Define the average vector which is still subgaussian with same norm (up to a constant). Standard results from functional analysis [28] guarantee
[TABLE]
with probability at least . This bounds the first term of (22). Next, we address the expectation term via following lemma.
Lemma V.10
Suppose is an isotropic Orlicz-a vector and . Let for sufficiently large constant . For , we have that
[TABLE]
The proof of Lemma V.10 is deferred to Section VII-F. Combining (23) and Lemma V.10 into (22), with probability at least , we find that,
[TABLE]
which is the desired bound for the first term in (20).
V-D2 Upper bounding the second term in (20)
The vector is zero-mean with . Hence, which implies that with probability ,
[TABLE]
Combining the bound above with (24), we get the advertised bound on the residual, namely
[TABLE]
with probability at least . This completes the proof for -subgaussian data.
V-E Proof of Theorem III.6 for subexponential samples
Suppose the dataset is -subexponential. Let and be as defined in Section II, recall from (13) and assume . Similar to the subgaussian case, we split the residual into two terms via (20) and bound each term separately to get a final bound.
V-E1 Upper bounding the first term in (20)
Let . With probability , we have that . We continue the analysis conditioned on this event. With bounded , is subexponential via
[TABLE]
Combining this with Lemma V.8, guarantees that
[TABLE]
with probability at least . Next, using Theorem V.10, we also upper bound by . Combining this with (26) and substituting into (the deterministic inequality) (22), with probability at least we have,
[TABLE]
V-E2 Upper bounding the second term in (20)
Using and applying Lemma V.8 (over one-dimensional ), we find that with probability .
Combining this with (27) and plugging into (20), we get the advertised upper bound
[TABLE]
which holds with probability at least . This completes the proof for -subexponential data.
VI Conclusion
We studied the problem of finding the best linear model from input-output samples under quadratic loss in the high-dimensional regime . For estimation, we utilized the projected gradient descent algorithm and showed its fast convergence as well as statistical accuracy in a data-dependent fashion. Our results are established for subexponential design which is heavier tailed compared to well-studied subgaussian. In both cases, we prove that nonlinearity of the problem behaves like independent noise and we establish favorable statistical guarantees as if the problem is linear. We also modified the original regression problem to allow for mean estimation and demonstrated its practical benefit when output labels have nonzero mean via simulations.
It would be desirable to extend our results to general loss function. If a loss function has the potential to better capture input/output relation, we can solve for
[TABLE]
Specifically this function can still be quadratic but characterized by a nonlinear link function i.e. . We believe that much of the results presented here extends to strongly-increasing where the derivative is lower bounded by a constant i.e. for some . These functions are shown to behave like linear regression [33]. However, it is not immediately clear if strong statistical and computational guarantees established in this paper (as well as related literature) can be established for .
VII Appendix
This section provides the proofs of supporting results.
VII-A Proof of Lemma V.1
We start by expanding the convergence term by substituting as follows,
[TABLE]
where, is the empirical average vector of i.i.d. subgaussian rows . Thus, using (29), we can write
[TABLE]
Given is isotropic subgaussian, Lemma 6.14 in [14] guarantees
[TABLE]
with probability at least . Furthermore, since ’s have bounded subgaussian norm, is also bounded and standard results from functional analysis guarantee [28]
[TABLE]
with probability at least . Combining the results (31) and (32) into (30), we find that
[TABLE]
holds with probability at least . This completes the proof of Lemma V.1
VII-B Proof of Lemma V.2
Let the tangent balls and be as defined in (7) and (8) respectively. Define the sets
[TABLE]
and note that,
[TABLE]
Similarly, . Applying Lemma V.1 on and , with advertised probability, we have
[TABLE]
where . Now, for any , picking , we have
[TABLE]
To proceed, note that
[TABLE]
Hence, holds with the advertised probability.
VII-C Proof of Lemma V.4
Let be i.i.d. isotropic subgaussian samples and is the concatenated design matrix. Let denotes the element of the matrix . Since each has subexponential norm bounded by a constant, there exists a constant such that holds with probability at least using subexponential tail bound. Union bounding over all entries of yields that holds for all with probability at least . Hence, we can bound each row of with probability at least via
[TABLE]
or equivalently, we have
[TABLE]
This completes the proof of Lemma V.4.
VII-D Proof of Lemma V.5
Recall that are i.i.d. isotropic subexponential vectors and . We can estimate the covariance matrix of given using law of total probability as follows
[TABLE]
Since a covariance matrix is positive-semidefinite, each term in (35) is individually positive semidefinite. Hence, we will drop the second term in (35) to get the following lower bound on the covariance matrix
[TABLE]
Using Lemma V.4, it follows that holds with probability at least . Hence, following (36), we get
[TABLE]
This completes the proof of Lemma V.5.
VII-E Proof of Lemma V.9
Subgaussian case: Using subgaussian tail, for large enough constant , for each , we have with probability at least . This implies . Union bounding over all entries of , we find the result which holds with probability at least .
Subexponential case: follows similarly with .
VII-F Proof of Lemma V.10
We prove the result for subexponential samples. Subgaussian case follows similarly. Without loss of generality, let as everything can be scaled accordingly. Defining clip function as previously, set . Furthermore, let denotes the tail of , such that,
[TABLE]
is an upper bound on the error due to clipping, that is,
[TABLE]
We proceed by upper bounding in terms of , using subadditive property of -norm and the orthogonality of and (i.e., ) as follows
[TABLE]
Using subexponentiality, for some constant , we have that, and , where, the latter follows from union bounding over all entries of . Union bounding these two events, we get the following tail bound for their product,
[TABLE]
For notational convenience, set
[TABLE]
and note that satisfies the following property due to (37)
[TABLE]
Furthermore, from (40) we get the following tail distribution
[TABLE]
for . Combining (41), (42) and (43) into (39) and denoting probability density function of by , we get
[TABLE]
where, (a) follows from (43). To bound the term on the right hand side, we do a change of variable in (44) by setting to get,
[TABLE]
Combining this with (44), we get
[TABLE]
where, we get (a) by picking with sufficiently large . Finally, note that conditioned on , and
[TABLE]
Since , this yields \|{\operatorname{\mathbb{E}}[w\bm{x}{~{}\big{|}~{}}|w|\leq B]}\|_{\ell_{2}}\lesssim{p^{2}n^{-201}} which is the advertised result with .
Similarly for subgaussian samples, one can show that
[TABLE]
Picking with sufficiently large , we get the same result, concluding the proof of Lemma V.10.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Y. Plan, R. Vershynin, and E. Yudovina, “High-dimensional estimation with geometric constraints,” Information and Inference: A Journal of the IMA , vol. 6, no. 1, pp. 1–40, 2016.
- 2[2] P. T. Boufounos and R. G. Baraniuk, “1-bit compressive sensing,” in Information Sciences and Systems, 2008. CISS 2008. 42nd Annual Conference on . IEEE, 2008, pp. 16–21.
- 3[3] R. Ganti, N. Rao, R. M. Willett, and R. Nowak, “Learning single index models in high dimensions,” ar Xiv preprint ar Xiv:1506.08910 , 2015.
- 4[4] S. Oymak and M. Soltanolkotabi, “Fast and reliable parameter estimation from nonlinear observations,” SIAM Journal on Optimization , vol. 27, no. 4, pp. 2276–2300, 2017.
- 5[5] Y. Plan, R. Vershynin, and E. Yudovina, “High-dimensional estimation with geometric constraints,” Information and Inference: A Journal of the IMA , vol. 6, no. 1, pp. 1–40, 2017.
- 6[6] Y. Plan and R. Vershynin, “The generalized lasso with non-linear observations,” IEEE Transactions on information theory , vol. 62, no. 3, pp. 1528–1537, 2016.
- 7[7] C. Thrampoulidis, E. Abbasi, and B. Hassibi, “Lasso with non-linear measurements is equivalent to one with linear measurements,” in Advances in Neural Information Processing Systems , 2015, pp. 3420–3428.
- 8[8] L. Jacques, J. N. Laska, P. T. Boufounos, and R. G. Baraniuk, “Robust 1-bit compressive sensing via binary stable embeddings of sparse vectors,” IEEE Transactions on Information Theory , vol. 59, no. 4, pp. 2082–2102, 2013.
