Quickly Finding the Best Linear Model in High Dimensions

Yahya Sattar; Samet Oymak

arXiv:1907.01728·cs.LG·February 9, 2021

Quickly Finding the Best Linear Model in High Dimensions

Yahya Sattar, Samet Oymak

PDF

Open Access

TL;DR

This paper introduces a projected gradient descent algorithm for efficiently finding the optimal linear model in high-dimensional settings, with theoretical guarantees and practical validation.

Contribution

It presents a novel PGD method with convergence and error bounds applicable to heavy-tailed distributions, without assuming realizability, and includes bias learning augmentation.

Findings

01

Linear convergence rate established for PGD.

02

Effective in heavy-tailed sub-exponential distributions.

03

Numerical experiments confirm theoretical predictions.

Abstract

We study the problem of finding the best linear model that can minimize least-squares loss given a data-set. While this problem is trivial in the low dimensional regime, it becomes more interesting in high dimensions where the population minimizer is assumed to lie on a manifold such as sparse vectors. We propose projected gradient descent (PGD) algorithm to estimate the population minimizer in the finite sample regime. We establish linear convergence rate and data dependent estimation error bounds for PGD. Our contributions include: 1) The results are established for heavier tailed sub-exponential distributions besides sub-gaussian. 2) We directly analyze the empirical risk minimization and do not require a realizable model that connects input data and labels. 3) Our PGD algorithm is augmented to learn the bias terms which boosts the performance. The numerical experiments validate our…

Equations233

L (θ) = \frac{1}{2} i = 1 \sum n (y_{i} - ⟨ θ, x_{i} ⟩)^{2} . \vspace - 5 pt

L (θ) = \frac{1}{2} i = 1 \sum n (y_{i} - ⟨ θ, x_{i} ⟩)^{2} . \vspace - 5 pt

θ^{⋆} = ar g θ min E [L (θ)] = E [y x] .

θ^{⋆} = ar g θ min E [L (θ)] = E [y x] .

\hat{θ} = ar g θ min \frac{1}{2} ∥ y - X θ ∥_{ℓ_{2}}^{2} subject to R (θ) \leq R .

\hat{θ} = ar g θ min \frac{1}{2} ∥ y - X θ ∥_{ℓ_{2}}^{2} subject to R (θ) \leq R .

\hat{θ}, \overset{μ}{^} = ar g θ, μ min L (θ, μ) subject to R (θ) \leq R .

\hat{θ}, \overset{μ}{^} = ar g θ, μ min L (θ, μ) subject to R (θ) \leq R .

θ_{τ + 1} = P_{K} (θ_{τ} - η \nabla L_{θ} (θ_{τ}, μ_{τ})),

θ_{τ + 1} = P_{K} (θ_{τ} - η \nabla L_{θ} (θ_{τ}, μ_{τ})),

μ_{τ + 1} = μ_{τ} - η \nabla L_{μ} (θ_{τ}, μ_{τ}),

θ^{⋆}, μ^{⋆}

θ^{⋆}, μ^{⋆}

= E [y x], E [y] .

K

K

K_{ext}

\begin{bmatrix}\bm{\theta}_{\tau+1}\\ \mu_{\tau+1}\end{bmatrix}=\mathcal{P}_{\mathcal{K}_{\text{ext}}}\bigg{(}\begin{bmatrix}\bm{\theta}_{\tau}\\ \mu_{\tau}\end{bmatrix}+\eta[{\bm{X}}~{}\mathbf{1}]^{T}\bigg{(}\bm{y}-[{\bm{X}}~{}\mathbf{1}]\begin{bmatrix}\bm{\theta}_{\tau}\\ \mu_{\tau}\end{bmatrix}\bigg{)}\bigg{)},

\begin{bmatrix}\bm{\theta}_{\tau+1}\\ \mu_{\tau+1}\end{bmatrix}=\mathcal{P}_{\mathcal{K}_{\text{ext}}}\bigg{(}\begin{bmatrix}\bm{\theta}_{\tau}\\ \mu_{\tau}\end{bmatrix}+\eta[{\bm{X}}~{}\mathbf{1}]^{T}\bigg{(}\bm{y}-[{\bm{X}}~{}\mathbf{1}]\begin{bmatrix}\bm{\theta}_{\tau}\\ \mu_{\tau}\end{bmatrix}\bigg{)}\bigg{)},

[X 1] = x_{1}^{T} 1 ⋮ x_{n}^{T} 1 .

[X 1] = x_{1}^{T} 1 ⋮ x_{n}^{T} 1 .

\displaystyle\mathcal{C}={{\text{${{\bf{\text{cl}}}}$}}}(\{\alpha\bm{v}{~{}\big{|}~{}}\bm{v}+\bm{\theta}^{\star}\in\mathcal{K},~{}\alpha\geq 0\})\bigcap\mathcal{B}^{p}.

\displaystyle\mathcal{C}={{\text{${{\bf{\text{cl}}}}$}}}(\{\alpha\bm{v}{~{}\big{|}~{}}\bm{v}+\bm{\theta}^{\star}\in\mathcal{K},~{}\alpha\geq 0\})\bigcap\mathcal{B}^{p}.

\displaystyle\mathcal{Cc}_{\text{ext}}=\bigg{\{}\begin{bmatrix}\alpha\bm{v}\\ \gamma\end{bmatrix}{~{}\big{|}~{}}\alpha\geq 0,~{}\bm{v}\in\mathcal{C},~{}\gamma\in\mathbb{R}\bigg{\}}\bigcap\mathcal{B}^{p+1}.

\displaystyle\mathcal{Cc}_{\text{ext}}=\bigg{\{}\begin{bmatrix}\alpha\bm{v}\\ \gamma\end{bmatrix}{~{}\big{|}~{}}\alpha\geq 0,~{}\bm{v}\in\mathcal{C},~{}\gamma\in\mathbb{R}\bigg{\}}\bigcap\mathcal{B}^{p+1}.

∥ h_{τ + 1} ∥_{ℓ_{2}} \leq κ (∥ h_{τ} ∥_{ℓ_{2}} ρ (C) + η ν (C))

∥ h_{τ + 1} ∥_{ℓ_{2}} \leq κ (∥ h_{τ} ∥_{ℓ_{2}} ρ (C) + η ν (C))

ρ (C) = u, v \in C c_{ext} sup ∣ u^{T} (I - η [X 1]^{T} [X 1]) v ∣,

ρ (C) = u, v \in C c_{ext} sup ∣ u^{T} (I - η [X 1]^{T} [X 1]) v ∣,

ν (C) = v \in C c_{ext} sup ∣ v^{T} [X 1]^{T} w ∣.

ω (S) = E_{g \sim N (0, I_{p})} [v \in S sup v^{T} g] .

ω (S) = E_{g \sim N (0, I_{p})} [v \in S sup v^{T} g] .

ω_{n} (T) = rad (S) \leq C clconv (S) \supseteq T min ω (S) + \frac{γ _{1} ( S )}{n}

ω_{n} (T) = rad (S) \leq C clconv (S) \supseteq T min ω (S) + \frac{γ _{1} ( S )}{n}

ω^{2} (C) \sim ω_{n}^{2} (C)

ω^{2} (C) \sim ω_{n}^{2} (C)

∥ X ∥_{ψ_{a}} = p \geq 1 sup p^{- 1/ a} (E [∣ X ∣^{p}])^{1/ p}

∥ X ∥_{ψ_{a}} = p \geq 1 sup p^{- 1/ a} (E [∣ X ∣^{p}])^{1/ p}

∥ y - x^{T} θ^{⋆} - μ^{⋆} ∥_{ψ_{a}} \leq σ .

∥ y - x^{T} θ^{⋆} - μ^{⋆} ∥_{ψ_{a}} \leq σ .

\displaystyle\bigg{\|}{\begin{bmatrix}\bm{\theta}_{\tau}-\bm{\theta}^{\star}\\ \mu_{\tau}-\mu^{\star}\end{bmatrix}}\bigg{\|}_{\ell_{2}}\leq{}

\displaystyle\bigg{\|}{\begin{bmatrix}\bm{\theta}_{\tau}-\bm{\theta}^{\star}\\ \mu_{\tau}-\mu^{\star}\end{bmatrix}}\bigg{\|}_{\ell_{2}}\leq{}

+ C σ \frac{( ω ( C ) + t ) lo g ( n )}{n} .

\displaystyle\bigg{\|}{\begin{bmatrix}\bm{\theta}_{\tau}-\bm{\theta}^{\star}\\ \mu_{\tau}-\mu^{\star}\end{bmatrix}}\bigg{\|}_{\ell_{2}}

\displaystyle\bigg{\|}{\begin{bmatrix}\bm{\theta}_{\tau}-\bm{\theta}^{\star}\\ \mu_{\tau}-\mu^{\star}\end{bmatrix}}\bigg{\|}_{\ell_{2}}

+ C σ \frac{( ω _{n} ( C ) + t ) lo g ( n )}{n} .

ρ (C) ≲ \frac{ω ( C ) + t}{n} .

ρ (C) ≲ \frac{ω ( C ) + t}{n} .

ρ (C) \leq 1 - C_{0} η n .

ρ (C) \leq 1 - C_{0} η n .

w = y - X θ^{⋆} - 1 μ^{⋆} .

w = y - X θ^{⋆} - 1 μ^{⋆} .

e = [X 1]^{T} w = i = 1 \sum n (y_{i} - μ^{⋆} - x_{i}^{T} θ^{⋆}) [x_{i} 1] .

e = [X 1]^{T} w = i = 1 \sum n (y_{i} - μ^{⋆} - x_{i}^{T} θ^{⋆}) [x_{i} 1] .

\frac{ν ( C )}{n} ≲ \frac{σ ( ω ( C ) + t ) lo g ( n )}{n} .

\frac{ν ( C )}{n} ≲ \frac{σ ( ω ( C ) + t ) lo g ( n )}{n} .

\frac{ν ( C )}{n} ≲ \frac{σ ( ω _{n} ( C ) + t ) lo g ( n )}{n} .

\frac{ν ( C )}{n} ≲ \frac{σ ( ω _{n} ( C ) + t ) lo g ( n )}{n} .

∥ h_{τ} ∥_{ℓ_{2}}

∥ h_{τ} ∥_{ℓ_{2}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Statistical Methods and Inference · Stochastic Gradient Optimization Techniques

Full text

Quickly Finding the Best Linear Model in High Dimensions

Yahya Sattar Samet Oymak Department of Electrical and Computer Engineering, University of California, Riverside, CA 92521, USA. Email: [email protected], [email protected].

Abstract

We study the problem of finding the best linear model that can minimize least-squares loss given a dataset. While this problem is trivial in the low-dimensional regime, it becomes more interesting in high-dimensions where the population minimizer is assumed to lie on a manifold such as sparse vectors. We propose projected gradient descent (PGD) algorithm to estimate the population minimizer in the finite sample regime. We establish linear convergence rate and data-dependent estimation error bounds for PGD. Our contributions include: 1) The results are established for heavier tailed sub-exponential distributions besides sub-gaussian. 2) We directly analyze the empirical risk minimization and do not require a realizable model that connects input data and labels. 3) Our PGD algorithm is augmented to learn the bias terms which boosts the performance. The numerical experiments validate our theoretical results.

Index Terms:

high-dimensional estimation, projected gradient descent, one-bit compressed sensing, gaussian width.

I Introduction

Supervised learning is concerned with finding a relation between the input-output pairs $(\bm{x}_{i},y_{i})_{i=1}^{n}\in\mathbb{R}^{p}\times\mathbb{R}$ . The simplest relations are linear functions where the output $y_{i}$ is estimated by a linear function of the input, that is, $\hat{y}_{i}=\left<\bm{x}_{i},\bm{\theta}\right>$ . Using quadratic loss, we can find the optimal $\bm{\theta}$ with a simple linear regression which minimizes.

[TABLE]

If the samples are i.i.d. and input has identity covariance, the population minimizer ( $n\rightarrow\infty$ ) is simply given by

[TABLE]

where $(\bm{x},y)$ is drawn from same distribution as data. In many applications, we operate in the high-dimensional regime where we have fewer samples than the parameter dimension i.e. $n\ll p$ . In this case, the problem is ill-posed; however, if $\bm{\theta}^{\star}$ lies on a low-dimensional manifold, we can take advantage of this information to solve the problem. We assume $\bm{\theta}^{\star}$ is structured-sparse, for instance, it can be a signal that is sparse in a dictionary or it can be a low-rank matrix. If $\mathcal{R}$ is a regularization function that promotes this structure, we can solve the regularized empirical risk minimization (ERM)

[TABLE]

where $\bm{y}=[y_{1}~{}\dots~{}y_{n}]^{T}\in\mathbb{R}^{n}~{}\textrm{and}~{}{\bm{X}}=[\bm{x}_{1}~{}\dots~{}\bm{x}_{n}]^{T}\in\mathbb{R}^{n\times p}$ are the output labels and data matrix respectively. This problem is well-studied in the statistics and compressed sensing (CS) literature. However, much of the theory literature is concerned with the scenario where the problem is realizable i.e. the outputs are explicitly generated with respect to some ground truth vector $\bm{a}$ . In the simplest scenario, input/output relation can be $y=\left<\bm{x},\bm{a}\right>+{\bm{z}}$ where ${\bm{z}}$ is independent zero-mean noise vector. In this case, one simply has $\bm{\theta}^{\star}=\bm{a}$ . Such realizability assumption is also common in the single-index models [1, 2]. One contribution of this paper will be analyzing regularized ERM without the realizability assumption.

Bias in the data can negatively affect the estimation quality. Assuming input is zero-mean, instead of solving (1) we can solve a modified problem which accounts for the mean of the output as well. Again, denoting the regularization function by $\mathcal{R}$ , we will solve the modified problem

[TABLE]

where the loss is given by ${\cal{L}}(\bm{\theta},\mu)=\frac{1}{2}\big{\|}\bm{y}-[{\bm{X}}~{}\mathbf{1}]\begin{bmatrix}\bm{\theta}\\ \mu\end{bmatrix}\big{\|}_{\ell_{2}}^{2}$ . We will show that solving problem (2) is essentially equivalent to solving (1) with debiased output hence it will result in more accurate estimation. The goal of this paper is studying problem (2) under a general algorithmic framework, establishing finite-sample statistical and algorithmic convergence, and addressing practical considerations on the data distribution. In particular, we are interested in how well one can estimate the best linear model (BLM) given by the pair $(\bm{\theta}^{\star}=\operatorname{\mathbb{E}}[y\bm{x}],\mu^{\star}=\operatorname{\mathbb{E}}[y])$ . For estimation, we will utilize the projected gradient descent algorithm given by the iterates

[TABLE]

where ${\cal{P}}_{\mathcal{K}}$ projects onto the constraint set $\mathcal{K}=\{\bm{\theta}\in\mathbb{R}^{p}{~{}\big{|}~{}}\mathcal{R}(\bm{\theta})\leq R\}$ and $\eta$ is the step size.

I-A Relation to Prior Work

There is a significant amount of literature on nonlinear (or one-bit) CS [3, 4, 5, 6, 7, 2, 8, 9, 10, 11, 12]. [13, 4, 14, 15, 16] study algorithmic and statistical convergence rates for first order methods such as projected/proximal gradient descent. For nonlinear CS, [17, 4, 7, 5] provide statistical analysis of single index estimation with a focus on Gaussian data. Recently, one-bit CS techniques have been extended to sub-gaussian distributions using dithering trick which adds noise before quantization [18, 19, 20, 21]. Dithering is introduced to guarantee consistent estimation of the ground-truth parameter. The papers [22, 23, 24, 25, 26] address non-gaussianity by utilizing Stein identity which requires access to the distribution of the input samples. Closer to us [27] studies the constrained empirical risk minimization with linear functions and squared loss with a focus on convex problems. In comparison our analysis applies to a broader class of distributions and focus on first order algorithms. Much of our analysis focuses on addressing subexponential samples, which requires tools from high-dimensional probability [28, 29].

Our results apply to general regularizers and borrow ideas from [4, 5, 6, 7]. Similar to these, we view the nonlinearity between input and output as an additive noise. The convergence analysis of projected gradient descent is a rather well-understood topic and we utilize insights from [15, 14, 13, 16] for our analysis.

I-B Contributions

At a high-level our work has three distinguishing features compared to the prior literature.

$\bullet$ Subexponential samples: Most nonlinear CS results apply to Gaussian or subgaussian data when dithering trick is utilized [18, 19, 20, 21]. We take advantage of the recent techniques for subexponential distributions to provide statistical/computational guarantees for heavier-tailed distributions.

$\bullet$ No realizability assumption: Nonlinear CS literature is typically concerned with a ground-truth vector to be recovered. For instance, one-bit CS aims to learn $\bm{\theta}$ from samples of type $y=\textrm{sgn}(\bm{\theta}^{T}\bm{x})$ . Unlike these, we do not enforce such relationship to exist between input and output, hence the results apply under much weaker assumptions. Instead of a ground-truth $\bm{\theta}$ , we work with the population BLM $\bm{\theta}^{\star}$ . However, $\bm{\theta}^{\star}$ can be shown to coincide with ground truth when it exists, if the input distribution is nice (e.g. Gaussian) [4, 5, 6, 7].

$~{}~{}~{}\bullet$ Bias estimation: Our analysis addresses the bias in the output by solving the modified problem (2). We show that (2) can be studied in a similar fashion to (1) by studying the statistical properties of the concatenated data matrix. However, empirically this modification results in a substantial improvement in estimation.

I-C Paper Organization

We review mathematical background and formulate the problem in Section II. We introduce our main results on statistical and computational convergence guarantees in Section III. Section IV provides numerical experiments to corroborate our theoretical results. Proofs of the main results are provided in Section V and finally the concluding remarks are made in Section VI.

II Preliminaries and Problem Formulation

In this section we introduce statistical quantities which are utilized to characterize the benefits of the regularization $\mathcal{R}$ .

We first set the notation. $c,c_{0},\dots,C$ denote positive absolute constants. For a vector $\bm{v}$ , we denote its Euclidean norm by $\|{\bm{v}}\|_{\ell_{2}}$ and its $\ell_{\infty}$ norm by $\|\bm{v}\|_{\infty}$ . Similarly for a matrix ${\bm{X}}$ , we denote its spectral norm by $\|{\bm{X}}\|$ . Given a set $S$ , let ${{\text{$ {{\bf{\text{cl}}}} $}}}(S)$ and ${{\text{$ {{\bf{\text{clconv}}}} $}}}(S)$ be the minimal closed set and minimal closed-convex set containing $S$ respectively. Let $\text{rad}(S)$ denote the set radius $\sup_{\bm{v}\in S}\|{\bm{v}}\|_{\ell_{2}}$ . For closed sets, let ${\cal{P}}_{S}(\cdot)$ be the projection operator defined as ${\cal{P}}_{S}(\bm{a})=\arg\min_{\bm{v}\in S}\|{\bm{a}-\bm{v}}\|_{\ell_{2}}$ . $\mathcal{N}(\mu,\sigma^{2})$ denotes the normal distribution and $\mathcal{B}^{p}$ denote the unit ball in $\mathbb{R}^{p}$ . ${\mathbf{1}}$ is the all ones vector of proper dimension. We will use $\gtrsim$ and $\lesssim$ for inequalities that hold up to a constant factor.

Suppose we are given $n$ i.i.d. samples $(\bm{x}_{i},y_{i})_{i=1}^{n}\sim(\bm{x},y)$ . To keep the exposition clean, we assume that $\bm{x}$ is whitened, that is, it has zero-mean and identity covariance. We will aim to find a linear relation between the modified input-output pairs $([\bm{x}_{i}^{T}~{}1]^{T},y_{i})_{i=1}^{n}$ . Let us consider the statistical properties of our modified estimate in the population limit which is given by

[TABLE]

Thus, in the limiting case, $\mu^{\star}$ captures the mean of the output and $\bm{\theta}^{\star}$ is the ideal solution of the problem with debiased output. Our goal is estimating the population minimizer ${\bm{\theta}^{\star}},\mu^{\star}$ ; which minimizes the expected quadratic loss $\operatorname{\mathbb{E}}[(y-\bm{\theta}^{T}\bm{x}-\mu)^{2}]$ . As discussed in Section I, assuming $\bm{\theta}^{\star}$ is structured sparse, we consider a non-asymptotic estimation of ${\bm{\theta}^{\star}},\mu^{\star}$ via problem (2). To proceed with analysis, set

[TABLE]

We investigate the PGD algorithm (3) which can be written as

[TABLE]

where $\eta$ is a fixed learning rate and $[{\bm{X}}~{}\mathbf{1}]\in\mathbb{R}^{n\times(p+1)}$ is the modified data matrix constructed as follows

[TABLE]

Following [4, 30] PGD analysis can be related to the tangent ball around the population parameter $\bm{\theta}^{\star}$ which is given by

[TABLE]

Similarly, we define the extended tangent ball as follows

[TABLE]

The two definitions above are closely related. For any vector $\bm{v}\in\mathcal{C}$ , we have that $[\sqrt{1-\gamma^{2}}\bm{v}^{T}~{}\gamma]^{T}\in\mathcal{Cc}_{\text{ext}}$ for $|\gamma|\leq 1$ . In the following we will express the convergence rates and residual errors of the PGD algorithm (3) in terms of the statistical properties of the tangent balls .

Technical approach: Denoting the parameter estimation error in (6) by $\bm{h}_{\tau}=[{\bm{\theta}_{\tau}}^{T}~{}\mu_{\tau}]^{T}-[{\bm{\theta}^{\star}}^{T}~{}\mu^{\star}]^{T}$ and the effective noise by $\bm{w}=\bm{y}-[{\bm{X}}~{}\mathbf{1}][{\bm{\theta}^{\star}}^{T}~{}\mu^{\star}]^{T}$ , the PGD update can be shown to obey [14] (see Eq. (VI.10))

[TABLE]

where $\kappa$ is a numerical constant which is equal to $1$ for convex regularizer $\mathcal{R}$ and $2$ for arbitrary $\mathcal{R}$ and

[TABLE]

Here $\rho$ captures the algorithmic convergence and $\nu$ captures the statistical accuracy in terms of regularization. To achieve statistical learning bounds, we need to characterize the quantities above in finite sample. Existing literature provides a fairly good understanding of the related terms when ${\bm{X}}$ has subgaussian rows or $\bm{w}$ is independent of ${\bm{X}}$ . The technical contributions of this work are i) extending these results to subexponential samples, ii) allowing for nonlinear dependencies between the noise and data, and iii) addressing the bias term by studying the concatenated matrix $[{\bm{X}}~{}\mathbf{1}]$ . To proceed with statistical analysis, we introduce Gaussian width.

Definition II.1 ((Perturbed) Gaussian width [29])

The Gaussian width of a set $S\subset\mathcal{B}^{p}$ is defined as

[TABLE]

Let $C>0$ be an absolute constant. Given an integer $n\geq 1$ , the perturbed Gaussian width $\omega_{n}(T)$ of $T\subset\mathcal{B}^{d}$ is defined as

[TABLE]

where $\gamma_{1}(S)$ is Talagrand’s $\gamma_{1}$ -functional (see [28]) with $\ell_{2}$ -metric.

Gaussian width helps to quantify the complexity of the regularized problem and determines the sample complexity of the linear inverse problems i.e. high-dimensional problems become manageable in the regime $n\gtrsim\omega^{2}(\mathcal{C})$ [31, 30]. Perturbed width is introduced more recently in [29] to address subexponential samples. [29] shows that, for standard regularizers such as $\ell_{0},\ell_{1}$ , subspace, and rank constraints, we have that

[TABLE]

in the interesting regime $n\geq\omega^{2}(\mathcal{C})$ . Hence, perturbed width has the same statistical accuracy of Gaussian width but applies to subexponential samples.

As illustrated in Table I, square of the Gaussian width captures the degrees of freedom for practical regularizers. Table I is obtained by setting $R=\mathcal{R}(\bm{\theta}^{\star})$ in (4). In practice, a good choice for $R$ can be found by using cross validation. It is also known that the performance of PGD is robust to choice of $R$ (see Thm 2.6 of [14]).

[TABLE]

The next statistical quantity required in our analysis is the Orlicz norm defined as.

Definition II.2 (Orlicz norms)

For a scalar random variable Orlicz- $a$ norm is defined as

[TABLE]

Orlicz- $a$ norm of a vector $\bm{x}\in\mathbb{R}^{d}$ is defined as $\|\bm{x}\|_{\psi_{a}}=\sup_{\bm{v}\in\mathcal{B}^{d}}\|\bm{v}^{T}\bm{x}\|_{\psi_{a}}$ . Subexponential and subgaussian norms are special cases of Orlicz- $a$ norm given by $\|{\cdot}\|_{\psi_{1}}$ and $\|{\cdot}\|_{\psi_{2}}$ respectively.

Based on perturbed Gaussian width definition, we will show that one can upper bound the critical quantities (10) and (11). In return, this will reveal the statistical and computational performance of the PGD algorithm. This is the topic of the next section which states our main results.

III Main Results

In this section we estimate the convergence rate and the statistical accuracy of the PGD algorithm as a function of sample size, complexity of the parameter (e.g. sparsity level), and the distribution of the data (whether subgaussian or subexponential). Our main theorem establishes a linear convergence rate of PGD and shows that PGD achieves statistically efficient error rates. We first describe the data model.

Definition III.1 (Isotropic vector)

$\bm{x}\in\mathbb{R}^{p}$ * is called an isotropic Orlicz- $a$ vector if it is zero-mean with identity covariance and if its Orlicz- $a$ norm $\|\bm{x}\|_{\psi_{a}}$ is bounded by an absolute constant.*

Definition III.2 ( $\sigma$ -noisy datasets)

We assume the samples $(y_{i},\bm{x}_{i})_{i=1}^{n}$ . We call a dataset $\sigma$ -Orlicz- $a$ if the input samples are isotropic Orlicz-a vectors and the residual at the ground truth obeys

[TABLE]

We call $\sigma$ -Orlicz- $1$ dataset $\sigma$ -subexponential and $\sigma$ -Orlicz- $2$ dataset $\sigma$ -subgaussian.

Note that residual at the ground truth is the noise in our problem which may be function of the nonlinearity. Our main results capture the PGD performance for different dataset models.

Theorem III.3 (Subgaussian)

Suppose $(\bm{x}_{i},y_{i})_{i=1}^{n}$ is a $\sigma$ -subgaussian dataset. Assume $n\gtrsim{(\omega(\mathcal{C})+t)^{2}}$ and set learning rate $\eta=1/n$ . Let $\mathcal{R}$ be an arbitrary regularizer. Starting form any initial estimate $[\bm{\theta}_{0}^{T}~{}\mu_{0}]^{T}$ , with probability at least $1-6\exp(-c_{0}t^{2}/2)-4n^{-100}$ , all PGD iterates (6) obeys

[TABLE]

Similarly, for subexponential samples, we have the following theorem which applies to convex regularizers.

Theorem III.4 (Subexponential)

Suppose $(\bm{x}_{i},y_{i})_{i=1}^{n}$ is a $\sigma$ -subexponential dataset. Set $q=(n+p)\log^{3}(n+p)$ . Set learning rate $\eta={c_{0}/q}$ , suppose $\mathcal{R}$ is convex and $n\gtrsim{(\omega_{n}(\mathcal{C})+t)^{2}}$ . Starting from initialization $[\bm{\theta}_{0}^{T}~{}\mu_{0}]^{T}$ , with probability at least $1-9\exp(-c_{0}{\min(n,t\sqrt{n},t^{2})})-5(n+p)^{-100}$ , all PGD iterates (6) obey

[TABLE]

Both of these results show that PGD iterates converge to population parameters $\bm{\theta}^{\star},~{}\mu^{\star}$ at a linear rate. Subexponential theorem requires a more conservative choice of learning rate. The statistical estimation error grows as $\omega(\mathcal{C})/\sqrt{n}$ for subgaussian and $\omega_{n}(\mathcal{C})/\sqrt{n}$ for subexponential. Since our results apply in the regime $n\gtrsim\omega^{2}(\mathcal{C})$ , following (12), statistical errors associated with subgaussian and subexponential are same up to a constant for typical regularizers.

Our main results follow from Theorems III.5 and III.6 which are the topics of the following sections.

III-A Controlling the Convergence Rate of PGD

In this section, we study the convergence rate characterized by the $\rho(\mathcal{C})$ term. The challenges we address are (i) characterizing the restricted singular values of the subexponential data matrices and (ii) addressing the concatenated all ones vector.

Theorem III.5 (Convergence rate)

Suppose $(\bm{x}_{i},y_{i})_{i=1}^{n}$ is a $\sigma$ -subgaussian dataset and $[{\bm{X}}~{}\mathbf{1}]$ is the modified-data matrix, where $\mathbf{1}$ is a vector of all ones. Let $\mathcal{C}$ and $\mathcal{Cc}_{\text{ext}}$ be the tangent balls as defined in (7) and (8) respectively. Assume $n\gtrsim{(\omega(\mathcal{C})+t)^{2}}$ . Setting $\eta=1/n$ , with probability at least $1-4e^{-t^{2}}$ we have

[TABLE]

If the dataset is $\sigma$ -subexponential, then setting $\eta={c_{0}/(n+p)\log^{3}(n+p)}$ and assuming $n\gtrsim(\omega_{n}(\mathcal{C})+t)^{2}$ , with probability $1-5\exp(-c\min(n,t\sqrt{n},t^{2}))-3(n+p)^{-100}$ , we have

[TABLE]

Note that, subexponential requires a smaller choice of learning rate which results in slower convergence.

III-B Bounding the Error due to Nonlinearity

Next, we provide a bound on the effective noise level $\nu(\mathcal{C})$ ; which is crucial for assessing statistical accuracy. This term arises from the nonlinearity and noise associated with the relation between input and output. For example, for single-index models, we have $\operatorname{\mathbb{E}}[y{~{}\big{|}~{}}\bm{x}]=\phi(\bm{x}^{T}\bm{\theta}_{\text{GT}})$ for some link function $\phi$ and ground truth $\bm{\theta}_{\text{GT}}$ , and $\phi$ becomes the source of the nonlinearity. Our approach is similar to [4, 5, 27, 6, 7] and treats the nonlinearity as a noise. The finite sample noise is captured by the residual vector

[TABLE]

Following $\nu(\mathcal{C})$ term in (11), the contribution of the residual $\bm{w}$ to the estimated parameter is captured by the vector

[TABLE]

Our key observation is that the properties of $\bm{e}$ can be characterized under fairly general assumptions compared to the existing literature; which is mostly restricted to zero-mean subgaussian samples.

Theorem III.6 (Statistical error)

Suppose $(\bm{x}_{i},y_{i})_{i=1}^{n}\sim(\bm{x},y)$ is a $\sigma$ -subgaussian dataset. Let the tangent balls $\mathcal{C}$ and $\mathcal{Cc}_{\text{ext}}$ be as defined in (7) and (8) respectively. Assume $n\gtrsim{(\omega(\mathcal{C})+t)^{2}}$ . Then, with probability at least $1-2\exp(-t^{2}/2)-4n^{-100}$ , we have

[TABLE]

where $\nu(\mathcal{C})$ is the effective noise given by (11). If $(\bm{x}_{i},y_{i})_{i=1}^{n}$ is a $\sigma$ -subexponential dataset and $n\gtrsim{(\omega_{n}(\mathcal{C})+t)^{2}}$ , with probability at least $1-4\exp(-c~{}{\min(t\sqrt{n},t^{2})})-2n^{-100}$ , we have

[TABLE]

This theorem establishes the crucial finite sample upper bounds on $\nu(\mathcal{C})$ for both subgaussian and subexponential data as a function of Gaussian width of the tangent ball. Combining our bounds on $\rho(\mathcal{C})$ and $\nu(\mathcal{C})$ and utilizing the recursion (9), we can obtain the PGD convergence characteristics and prove the main theorems.

IV Numerical Experiments

In this section, we discuss experiments that corroborate our theoretical results. We consider a standard single-index model where for some ground truth vector $\bbeta$ and link function $\phi$ , the input/output relation is given by $y_{i}=\phi(\bbeta^{T}\bm{x}_{i})$ . We pick $\bbeta$ to be a sparse vector with $s=20$ nonzeros and $p=800$ and set sample size to be $n=500$ . Because of sparsity prior, we run PGD as iterative hard thresholding where $\bm{\theta}_{\tau}$ is projected to be $s$ -sparse after every iteration. As link functions, we considered ReLU (i.e. $\max(x,0)$ ) and sign functions (maps to $\pm 1$ ); which are of interest for deep learning and quantization respectively. We generate $\bm{x}_{i}$ ’s with i.i.d. exponentially distributed entries (with parameter $\lambda=1$ ) and then remove the mean and normalize the covariance to identity. We pick a learning rate of $\eta=1/5n$ in all experiments. The shaded areas in the plots correspond to one standard deviation.

To assess test and training performance of PGD, we use the following three metrics:

•

the normalized training error defined as $\|{\bm{y}-{\bm{X}}\bm{\theta}_{\tau}-\mu_{\tau}\mathbf{1}}\|_{\ell_{2}}^{2}/\|{\bm{y}}\|_{\ell_{2}}^{2}$ ,

•

the normalized test error that is similarly defined but evaluated on a fresh dataset of size $n$ using the training model $\bm{\theta}_{\tau}$ ,

•

correlation to ground truth vector $\bbeta$ defined as $\frac{\bm{\theta}_{\tau}^{T}\bbeta}{\|{\bm{\theta}_{\tau}}\|_{\ell_{2}}\|{\bbeta}\|_{\ell_{2}}}$ .

We compare two baselines. First one is running PGD with ${\bm{X}}$ and $[{\bm{X}}~{}\mathbf{1}]$ separately. Second one assumes knowledge of ground truth $\bbeta$ and fits a model $\gamma\bbeta$ by finding $\gamma$ to minimize the training loss. Numerically, we minimize $\|{\bar{\bm{y}}-\gamma{\bm{X}}\bm{\beta}}\|_{\ell_{2}}^{2}$ over $\gamma$ where $\bar{y}_{i}=y_{i}-(1/n)\sum_{i=1}^{n}y_{i}$ . This sets $\gamma=\bar{\bm{y}}^{T}{\bm{X}}\bm{\beta}/\|{\bm{{\bm{X}}\beta}}\|_{\ell_{2}}^{2}$ .

Figure 1 plots the loss as a function of the PGD iterations $\tau$ . Both training and test errors gracefully decays with more iterations for both choices of link functions. The dashed values corresponds to $\gamma\bbeta$ ’s performance. While there is a slight mismatch between train/test performances (due to finite samples), high-dimensional estimation via PGD works well and performs on par with ground truth. Observe that for ReLU, $\operatorname{\mathbb{E}}[y]$ is nonzero and estimating mean should be beneficial. Indeed, Figures 1(c) and 1(d) demonstrates that $[{\bm{X}}~{}\mathbf{1}]$ substantially outperforms using ${\bm{X}}$ alone. There is no improvement for sign function since $\operatorname{\mathbb{E}}[y]\approx 0$ .

In Figure 2 we focus on the parameter estimation question by plotting the correlation between $\bm{\theta}_{\tau}$ and $\bbeta$ . Correlation is always between $-1,1$ and quantifies how well we can estimate direction of the ground truth vector via PGD. This experiment is conducted with two values of $n$ namely $250$ and $500$ while $p=800$ in both cases. Observe that, a larger sample size results in more stable estimation (smaller standard deviations) and higher correlation with output. Additionally Figure 2(d) shows that ReLU problem achieves better correlation once we account for the bias term. Hence, mean estimation is not only beneficial for test performance but also for parameter estimation.

V Proofs of Main Theorems

This section proves our main results and outlines the proofs of Theorems III.3, III.4, III.5 and III.6. Throughout, we use the same notation as described in II.

V-A Proof of Theorem III.4

We provide our analysis for subexponential samples. The extension to subgaussian samples is accomplished in an identical fashion. Set the estimation error at iteration $\tau$ to be $\bm{h}_{\tau}=[{\bm{\theta}_{\tau}}^{T}~{}\mu_{\tau}]^{T}-[{\bm{\theta}^{\star}}^{T}~{}\mu^{\star}]^{T}$ . Note that, when $\rho(\mathcal{C})<1$ and $\mathcal{R}$ is a convex regularizer, then the recursion (9) can be iteratively expanded as

[TABLE]

With the advertised probability, subexponential statements of Theorems III.5 and III.6 hold. Hence, for some constants, we have that $\rho(\mathcal{C})\leq 1-c_{0}{\eta n}$ , $\nu(\mathcal{C})\leq C\sqrt{n}{\sigma(\omega_{n}(\mathcal{C})+t)\log(n)}$ and $\eta={c}/{q}$ with $q=(n+p)\log^{3}(n+p)$ . Plugging these in (15), we find the following upper bound on the right hand side,

[TABLE]

which is the desired bound. The case of subgaussian samples is again a corollary of Theorems III.5 and III.6. This concludes the proof of our main result.

V-B Proof of Theorem III.5 for subgaussian samples

We start our proof with the following lemma.

Lemma V.1

Let $(\bm{x}_{i})_{i=1}^{n}\sim\bm{x}\in\mathbb{R}^{p}$ be i.i.d. isotropic subgaussian samples. Let ${\bm{X}}\in\mathbb{R}^{n\times p}$ be concatenated data and $[{\bm{X}}~{}\mathbf{1}]$ is the modified-data matrix, where $\mathbf{1}$ is a vector of all ones. Let $\mathcal{T}$ be a closed set with Euclidian radius bounded by a constant and

[TABLE]

where $\beta\leq C_{1},~{}\gamma\leq C_{2}$ for some positive constants $C_{1},C_{2}$ and $\bm{v}\in\mathcal{T}$ . Assume $n\gtrsim(\omega(\mathcal{T})+t)^{2}$ . Then, with probability at least $1-2e^{-t^{2}}$ we have

[TABLE]

The proof of Lemma V.1 is deferred to Section VII-A. Next using the result of Lemma V.1, we obtain the following lemma which bounds the convergence rate for subgaussian samples.

Lemma V.2

Consider the setup of Lemma V.1. Furthermore, let the tangent balls $\mathcal{C}$ and $\mathcal{Cc}_{\text{ext}}$ be as defined in (7) and (8) respectively. Following Lemma V.1, with probability at least $1-4e^{-t^{2}}$ , the following holds

[TABLE]

The proof of Lemma V.2 is deferred to Section VII-B. This completes the proof for subgaussian samples.

V-C Proof of Theorem III.5 for subexponential samples

Let $(\bm{x}_{i})_{i=1}^{n}\sim\bm{x}\in\mathbb{R}^{p}$ be i.i.d. isotropic subexponential vectors and ${\bm{X}}$ be the associated design matrix as previously. Let $\mathcal{C}$ and $\mathcal{Cc}_{\text{ext}}$ be as defined in 7 and 8 respectively. Assume $n\gtrsim\omega_{n}^{2}(\mathcal{C})$ . Our proof strategy is based on the observation that, we can bound the (restricted) singular values of $[{\bm{X}}~{}\mathbf{1}]^{T}[{\bm{X}}~{}\mathbf{1}]$ with high probability for subexponential data as follows.

V-C1 Upper bounding the singular values

In this section we will upper bound the largest eigenvalue of the matrix $[{\bm{X}}~{}\mathbf{1}]^{T}[{\bm{X}}~{}\mathbf{1}]$ with high probability. Towards this goal, we utilize Matrix Chernoff bound from [32].

Theorem V.3 (Matrix Chernoff [32])

Consider a finite sequence $\{\bm{X}_{i}\}_{i=1}^{n}$ of independent, random, Hermitian matrices with common dimension $d$ . Assume that

[TABLE]

Define the sum $\bm{M}=\sum_{i=1}^{n}\bm{X}_{i}$ and let $\zeta_{\max}$ be an upper bound on the spectral norm of the expectation $\operatorname{\mathbb{E}}[\bm{M}]$ i.e. $\zeta_{\max}\geq\|\operatorname{\mathbb{E}}[\bm{M}]\|=\|\sum_{i=1}^{n}\operatorname{\mathbb{E}}[\bm{X}_{i}]\|$ . We have that

[TABLE]

We will use Theorem V.3 to bound the largest eigenvalue of $[{\bm{X}}~{}\mathbf{1}]^{T}[{\bm{X}}~{}\mathbf{1}]$ . Observe that

[TABLE]

Clearly this matrix is positive semidefinite. To bound $\|[\bm{x}_{i}^{T}~{}1]^{T}[\bm{x}_{i}^{T}~{}1]\|$ , we use the following lemma.

Lemma V.4 (Spectral norm bound)

Let $(\bm{x}_{i})_{i=1}^{n}$ be i.i.d. isotropic subexponential samples in $\mathbb{R}^{p}$ . Then, with probability at least $1-2(n+p)^{-100}$ the spectral norm of all $\bm{x}_{i}\bm{x}_{i}^{T}$ matrices can be bounded as

[TABLE]

The proof of lemma V.4 is deferred to Section VII-C. Lemma V.4 guarantees that $\|[\bm{x}_{i}^{T}~{}1]^{T}[\bm{x}_{i}^{T}~{}1]\|\leq\|{[\bm{x}_{i}^{T}~{}1]^{T}}\|_{\ell_{2}}^{2}=\|{\bm{x}_{i}}\|_{\ell_{2}}^{2}+1\leq Cp\log^{2}(n+p)$ . Hence, we do satisfy the conditions required by Theorem V.3. Before using Theorem V.3 we will upper bound the spectral norm of the expectation $\operatorname{\mathbb{E}}[[{\bm{X}}~{}\mathbf{1}]^{T}[{\bm{X}}~{}\mathbf{1}]]$ as follows.

Lemma V.5 (Spectral norm bound of expectation)

Let $\bm{x}\in\mathbb{R}^{p}$ be an isotropic subexponential vector, $\tilde{\bm{x}}=[\bm{x}^{T}~{}1]^{T}$ and let $B=Cp\log^{2}(n+p)$ for sufficiently large constant $C>0$ . Then we have

[TABLE]

The proof of Lemma V.5 is deferred to Section VII-D. Thus, applying Lemma V.5 on the set of all $[\bm{x}_{i}^{T}~{}1]^{T}$ satisfying $\|[\bm{x}_{i}^{T}~{}1]^{T}[\bm{x}_{i}^{T}~{}1]\|\leq Cp\log^{2}(n+p)$ , we find that with probability $1-2(n+p)^{-100}$ the following holds

[TABLE]

Hence, we can pick $\zeta_{\max}\geq 2n$ to upper bound the largest eigenvalue of $\operatorname{\mathbb{E}}[[{\bm{X}}~{}\mathbf{1}]^{T}[{\bm{X}}~{}\mathbf{1}]]$ . Now, using Theorem V.3 with $\zeta_{\max}=C_{0}C(n+p)\log^{3}(n+p),L=Cp\log^{2}(n+p)$ and $\epsilon=e-1$ we get

[TABLE]

Union bounding, with probability at least $1-3(n+p)^{-100}$ ,

[TABLE]

V-C2 Lower bounding the singular values

In this section we will lower bound the gain of $[{\bm{X}}~{}\mathbf{1}]$ restricted to the tangent ball $\mathcal{Cc}_{\text{ext}}$ . We will utilize the notion of restricted singular value (RSV) to proceed.

Definition V.6 (Restricted singular value)

Given a matrix $\bm{M}$ and a closed set $\mathcal{C}$ , the RSV of ${\bm{M}}$ at $\mathcal{C}$ is defined as

[TABLE]

In the following, we will lower bound $\min_{\tilde{\bm{v}}\in\mathcal{Cc}_{\text{ext}}}\|{[{\bm{X}}~{}\mathbf{1}]\tilde{\bm{v}}}\|_{\ell_{2}}$ which is the RSV of $[{\bm{X}}~{}\mathbf{1}]$ at $\mathcal{Cc}_{\text{ext}}$ . Recall that any $\tilde{\bm{v}}\in\mathcal{Cc}_{\text{ext}}$ with unit Euclidian norm obeys $\tilde{\bm{v}}=[\sqrt{1-\gamma^{2}}\bm{v}^{T}~{}\gamma]^{T}\in\mathcal{Cc}_{\text{ext}}$ for $|\gamma|\leq 1$ and $\bm{v}\in\mathcal{C}$ . Consequently

[TABLE]

Setting $\bar{\bm{x}}=\frac{1}{n}\sum_{i=1}^{n}\bm{x}_{i}$ and minimizing both sides over $\tilde{\bm{v}}\in\mathcal{Cc}_{\text{ext}}$ , we get

[TABLE]

In essence, (18) bounds RSV of $[{\bm{X}}~{}\mathbf{1}]$ in terms of the RSV of ${\bm{X}}$ and some simpler terms. The following theorem from [29] (Theorem D.11) gives a lower lower bound on the RSV of a matrix ${\bm{X}}$ with i.i.d. subexponential rows.

Theorem V.7 (Bounding RSV [29])

Let ${\bm{X}}\in\mathbb{R}^{n\times d}$ be a random matrix with i.i.d. isotropic subexponential rows. Let $\mathcal{C}$ be a tangent ball as in (7) and suppose the sample size obeys $n\gtrsim(\omega_{n}(\mathcal{C})+t)$ . Then with probability at least $1-3\exp(-c\min(n,t\sqrt{n},t^{2}))$ , we have that

[TABLE]

Next, we shall state a lemma from [29] (Lemma D.7) to upper bound the term involving the sample average $\bar{\bm{x}}$ .

Lemma V.8 (Bounding empirical width [29])

Suppose $\mathcal{C}$ is a subset of the unit Euclidian ball and $(\bm{x}_{i})_{i=1}^{n}$ are i.i.d. zero-mean vectors with bounded subexponential norm. Define the empirical average vector $\bar{\bm{x}}=\frac{1}{n}\sum_{i}\bm{x}_{i}$ . We have that

[TABLE]

Combining Theorem V.7 and Lemma V.8 into (18) we find that, there exist constants $c,c_{0},C_{0}>0$ such that with probability at least $1-5\exp(-c\min(n,t\sqrt{n},t^{2}))$ , we can lower bound the RSV of $[{\bm{X}}~{}\mathbf{1}]$ as,

[TABLE]

where last line follows from the assumption that $n\gtrsim(\omega_{n}(\mathcal{C})+t)^{2}$ .

V-C3 Upper bounding the convergence rate

Union bounding the events (17) and (19), we obtain upper and lower bounds on the singular values of $[{\bm{X}}~{}\mathbf{1}]$ with the desired probability. Hence, we can bound the convergence rate of PGD as follows. Setting $q=(n+p)\log^{3}(n+p)$ , we have (17) $\|[{\bm{X}}~{}\mathbf{1}]^{T}[{\bm{X}}~{}\mathbf{1}]\|\leq Cq$ . Therefore, choosing learning rate $\eta={1/Cq}$ , the matrix ${\bm{I}}-\eta[{\bm{X}}~{}\mathbf{1}]^{T}[{\bm{X}}~{}\mathbf{1}]$ is positive semidefinite (PSD). Hence, applying the generalized Cauchy-Schwarz inequality for PSD matrix, we find

[TABLE]

Here the last inequality follows from (19). This completes the proof for subexponential samples.

V-D Proof of Theorem III.6 for subgaussian samples

Suppose the dataset $(\bm{x}_{i},y_{i})_{i=1}^{n}\sim(\bm{x},y)$ is $\sigma$ -subgaussian. Let ${\bm{X}},[{\bm{X}}~{}\mathbf{1}],\mathcal{C}$ and $\mathcal{Cc}_{\text{ext}}$ be as defined in Section II, recall $\bm{w}$ from (13) and assume $n\gtrsim(\omega(\mathcal{C})+t)^{2}$ . Representing $\tilde{\bm{v}}\in\mathcal{Cc}_{\text{ext}}$ as $\tilde{\bm{v}}=[\sqrt{1-\gamma^{2}}\bm{v}^{T}~{}\gamma]^{T}$ for $\bm{v}\in\mathcal{C}$ and $|\gamma|\leq 1$ , we have

[TABLE]

In the following we will upper bound the terms $\sup_{\bm{v}\in\mathcal{C}}|\bm{v}^{T}{\bm{X}}^{T}\bm{w}|$ and $|\mathbf{1}^{T}\bm{w}|$ separately and will combine them to get an upper bound on the residual error.

V-D1 Upper bounding the first term in (20)

In order to upper bound the first term in (20), define the clipping function

[TABLE]

The following lemma immediately follows from union bounding the large deviations of subgaussian and subexponential variables $X$ and shows that $X=\text{clip}(X,B)$ with high probability.

Lemma V.9

Let $(w_{i})_{i=1}^{n}$ be i.i.d. subgaussian random variables with $\|{w_{i}}\|_{\psi_{2}}\leq\sigma$ . There exists a constant $C>0$ such that picking $B=C\sqrt{\log(n)}$ , with probability $1-2n^{-100}$ for all $i$ , we have

[TABLE]

If instead $(w_{i})_{i=1}^{n}$ are i.i.d. subexponential with $\|{w_{i}}\|_{\psi_{1}}\leq\sigma$ , then picking $B=C\log(n)$ leads to the same result.

Using Lemma V.9, $\|\bm{w}\|_{\infty}\leq\sigma B$ with probability $1-2n^{-100}$ . Conditioned on this event, we have

[TABLE]

Setting ${\bm{z}}_{i}=\text{clip}(w_{i},\sigma B)\bm{x_{i}}=w_{i}\bm{x}_{i}$ , (21) can be re-written as

[TABLE]

Note that ${\bm{z}}_{i}=w_{i}\bm{x}_{i}$ is subgaussian since $w_{i}$ is bounded. The subgaussian norm obeys

[TABLE]

Define the average vector $\bar{{\bm{z}}}=n^{-1/2}\sum_{i=1}^{n}({\bm{z}}_{i}-\operatorname{\mathbb{E}}[{\bm{z}}_{i}])$ which is still subgaussian with same norm (up to a constant). Standard results from functional analysis [28] guarantee

[TABLE]

with probability at least $1-2e^{-t^{2}/2}$ . This bounds the first term of (22). Next, we address the expectation term $\|{\operatorname{\mathbb{E}}[{\bm{z}}_{1}]}\|_{\ell_{2}}$ via following lemma.

Lemma V.10

Suppose $\bm{x}$ is an isotropic Orlicz-a vector and $\|w\|_{\psi_{a}}\leq\sigma$ . Let $B=C\log^{1/a}(n)$ for sufficiently large constant $C>0$ . For $a=1,2$ , we have that

[TABLE]

The proof of Lemma V.10 is deferred to Section VII-F. Combining (23) and Lemma V.10 into (22), with probability at least $1-2e^{-t^{2}/2}-2n^{-100}$ , we find that,

[TABLE]

which is the desired bound for the first term in (20).

V-D2 Upper bounding the second term in (20)

The vector $\bm{w}$ is zero-mean with $\|{\bm{w}}\|_{\psi_{2}}\leq\sigma$ . Hence, $\|{{\mathbf{1}}^{T}\bm{w}}\|_{\psi_{2}}\leq\sigma\sqrt{n}$ which implies that with probability $1-2n^{-100}$ ,

[TABLE]

Combining the bound above with (24), we get the advertised bound on the residual, namely

[TABLE]

with probability at least $1-2\exp(-t^{2}/2)-4n^{-100}$ . This completes the proof for $\sigma$ -subgaussian data.

V-E Proof of Theorem III.6 for subexponential samples

Suppose the dataset $(\bm{x}_{i},y_{i})_{i=1}^{n}\sim(\bm{x},y)$ is $\sigma$ -subexponential. Let ${\bm{X}},[{\bm{X}}~{}\mathbf{1}],\mathcal{C}$ and $\mathcal{Cc}_{\text{ext}}$ be as defined in Section II, recall $\bm{w}$ from (13) and assume $n\gtrsim(\omega_{n}(\mathcal{C})+t)^{2}$ . Similar to the subgaussian case, we split the residual into two terms via (20) and bound each term separately to get a final bound.

V-E1 Upper bounding the first term in (20)

Let ${\bm{z}}_{i}=w_{i}\bm{x}_{i}$ . With probability $1-2n^{-100}$ , we have that $\|\bm{w}\|_{\infty}\lesssim\sigma\log n$ . We continue the analysis conditioned on this event. With bounded $w_{i}$ , ${\bm{z}}_{i}-\operatorname{\mathbb{E}}[{\bm{z}}_{i}]$ is subexponential via

[TABLE]

Combining this with Lemma V.8, guarantees that

[TABLE]

with probability at least $1-2\exp(-{\cal{O}}(\min(t\sqrt{n},t^{2})))$ . Next, using Theorem V.10, we also upper bound $\|{\operatorname{\mathbb{E}}[{\bm{z}}_{1}]}\|_{\ell_{2}}$ by $C\sigma p^{2}n^{-201}$ . Combining this with (26) and substituting into (the deterministic inequality) (22), with probability at least $1-2\exp(-{\cal{O}}(\min(t\sqrt{n},t^{2})))-2n^{-100}$ we have,

[TABLE]

V-E2 Upper bounding the second term in (20)

Using $\|{w_{i}}\|_{\psi_{1}}\lesssim\sigma$ and applying Lemma V.8 (over one-dimensional $\mathbb{R}$ ), we find that $|{\mathbf{1}}^{T}\bm{w}|\lesssim\sigma(1+t)\sqrt{n}$ with probability $1-2\exp(-c\,\min(t\sqrt{n},t^{2}))$ .

Combining this with (27) and plugging into (20), we get the advertised upper bound

[TABLE]

which holds with probability at least $1-4\exp(-c\,{\min(t\sqrt{n},t^{2})})-2n^{-100}$ . This completes the proof for $\sigma$ -subexponential data.

VI Conclusion

We studied the problem of finding the best linear model from $n$ input-output samples under quadratic loss in the high-dimensional regime $n\ll p$ . For estimation, we utilized the projected gradient descent algorithm and showed its fast convergence as well as statistical accuracy in a data-dependent fashion. Our results are established for subexponential design which is heavier tailed compared to well-studied subgaussian. In both cases, we prove that nonlinearity of the problem behaves like independent noise and we establish favorable statistical guarantees as if the problem is linear. We also modified the original regression problem to allow for mean estimation and demonstrated its practical benefit when output labels have nonzero mean via simulations.

It would be desirable to extend our results to general loss function. If a loss function $\ell$ has the potential to better capture input/output relation, we can solve for

[TABLE]

Specifically this function can still be quadratic but characterized by a nonlinear link function $\phi$ i.e. $\ell(y_{i},\left<\bm{\theta},\bm{x}_{i}\right>)=(y_{i}-\phi(\left<\bm{\theta},\bm{x}_{i}\right>))^{2}$ . We believe that much of the results presented here extends to strongly-increasing $\phi$ where the derivative is lower bounded by a constant i.e. $\phi^{\prime}\geq\alpha$ for some $\alpha>0$ . These functions are shown to behave like linear regression [33]. However, it is not immediately clear if strong statistical and computational guarantees established in this paper (as well as related literature) can be established for $\phi$ .

VII Appendix

This section provides the proofs of supporting results.

VII-A Proof of Lemma V.1

We start by expanding the convergence term by substituting $\tilde{\bm{v}}=[\beta\bm{v}^{T}~{}\gamma]^{T}$ as follows,

[TABLE]

where, $\bar{\bm{x}}=n^{-1}\sum_{i=1}^{n}\bm{x}_{i}$ is the empirical average vector of i.i.d. subgaussian rows $(\bm{x}_{i})_{i=1}^{n}$ . Thus, using (29), we can write

[TABLE]

Given ${\bm{X}}\in\mathbb{R}^{n\times p}$ is isotropic subgaussian, Lemma 6.14 in [14] guarantees

[TABLE]

with probability at least $1-e^{-t^{2}}$ . Furthermore, since $(\bm{x}_{i})_{i=1}^{n}$ ’s have bounded subgaussian norm, $\bar{\bm{x}}$ is also bounded and standard results from functional analysis guarantee [28]

[TABLE]

with probability at least $1-e^{-t^{2}}$ . Combining the results (31) and (32) into (30), we find that

[TABLE]

holds with probability at least $1-2e^{-t^{2}}$ . This completes the proof of Lemma V.1

VII-B Proof of Lemma V.2

Let the tangent balls $\mathcal{C}$ and $\mathcal{Cc}_{\text{ext}}$ be as defined in (7) and (8) respectively. Define the sets

[TABLE]

and note that,

[TABLE]

Similarly, $\omega(\mathcal{C}+\mathcal{C})\leq 2\omega(\mathcal{C})$ . Applying Lemma V.1 on $\mathcal{T}_{+}$ and $\mathcal{T}_{-}$ , with advertised probability, we have

[TABLE]

where $\Lambda(\bm{a},\bm{b})=\bm{a}^{T}({\bm{I}}-\frac{1}{n}[{\bm{X}}~{}\mathbf{1}]^{T}[{\bm{X}}~{}\mathbf{1}])\bm{b}$ . Now, for any ${\bm{u}},\bm{v}\in\mathcal{Cc}_{\text{ext}}$ , picking ${\bm{u}}+\bm{v}\in\mathcal{T}_{+}~{}\text{and}~{}{\bm{u}}-\bm{v}\in\mathcal{T}_{-}$ , we have

[TABLE]

To proceed, note that

[TABLE]

Hence, $|{\bm{{\Lambda}}}({\bm{u}},\bm{v})|=|{\bm{u}}^{T}({\bm{I}}-\frac{1}{n}[{\bm{X}}~{}\mathbf{1}]^{T}[{\bm{X}}~{}\mathbf{1}])\bm{v}|\lesssim(\omega(\mathcal{C})+t)/{\sqrt{n}}$ holds with the advertised probability.

VII-C Proof of Lemma V.4

Let $(\bm{x}_{i})_{i=1}^{n}\sim\bm{x}\in\mathbb{R}^{p}$ be i.i.d. isotropic subgaussian samples and ${\bm{X}}\in\mathbb{R}^{n\times p}$ is the concatenated design matrix. Let $x_{ij}$ denotes the $ij^{th}$ element of the matrix ${\bm{X}}$ . Since each $x_{ij}$ has subexponential norm bounded by a constant, there exists a constant $C>0$ such that $|x_{ij}|\leq C\log(n+p)$ holds with probability at least $1-2(n+p)^{-102}$ using subexponential tail bound. Union bounding over all entries of ${\bm{X}}$ yields that $|x_{ij}|\leq C\log(n+p)$ holds for all $i,j$ with probability at least $1-2(n+p)^{-100}$ . Hence, we can bound each row $\bm{x}_{i}$ of ${\bm{X}}$ with probability at least $1-2(n+p)^{-100}$ via

[TABLE]

or equivalently, we have

[TABLE]

This completes the proof of Lemma V.4.

VII-D Proof of Lemma V.5

Recall that $(\bm{x}_{i})_{i=1}^{n}\sim\bm{x}\in\mathbb{R}^{p}$ are i.i.d. isotropic subexponential vectors and $\tilde{\bm{x}}=[\bm{x}^{T}~{}1]^{T}$ . We can estimate the covariance matrix of $\tilde{\bm{x}}$ given $\|{\tilde{\bm{x}}}\|_{\ell_{2}}^{2}\leq B$ using law of total probability as follows

[TABLE]

Since a covariance matrix is positive-semidefinite, each term in (35) is individually positive semidefinite. Hence, we will drop the second term in (35) to get the following lower bound on the covariance matrix

[TABLE]

Using Lemma V.4, it follows that $\|{\tilde{\bm{x}}}\|_{\ell_{2}}^{2}=\|{[\bm{x}^{T}~{}1]^{T}}\|_{\ell_{2}}^{2}\leq Cp\log^{2}(n+p)=B$ holds with probability at least $1-2(n+p)^{-100}$ . Hence, following (36), we get

[TABLE]

This completes the proof of Lemma V.5.

VII-E Proof of Lemma V.9

Subgaussian case: Using subgaussian tail, for large enough constant $C>0$ , for each $i$ , we have $|w_{i}|\leq C\sigma\sqrt{\log(n)}=\sigma B$ with probability at least $1-2n^{-101}$ . This implies $\text{clip}(w_{i},\sigma B)=w_{i}$ . Union bounding over all entries of $\bm{w}$ , we find the result which holds with probability at least $1-2n^{-100}$ .

Subexponential case: follows similarly with $B=C\log(n)$ .

VII-F Proof of Lemma V.10

We prove the result for subexponential samples. Subgaussian case follows similarly. Without loss of generality, let $\sigma=1$ as everything can be scaled accordingly. Defining clip function as previously, set ${\bm{z}}=\text{clip}(w,B)\bm{x}$ . Furthermore, let $w_{\text{tail}}$ denotes the tail of $|w|$ , such that,

[TABLE]

$w_{\text{tail}}$ is an upper bound on the error due to clipping, that is,

[TABLE]

We proceed by upper bounding $\|{\operatorname{\mathbb{E}}[{\bm{z}}]}\|_{\ell_{2}}$ in terms of $w_{\text{tail}}$ , using subadditive property of $\ell_{2}$ -norm and the orthogonality of $w$ and $\bm{x}$ (i.e., $\operatorname{\mathbb{E}}[w\bm{x}]=0$ ) as follows

[TABLE]

Using subexponentiality, for some constant $c>0$ , we have that, $\operatorname{\mathbb{P}}(w_{\text{tail}}>\sqrt{c}t)\leq 2e^{-t}$ and $\operatorname{\mathbb{P}}(\|{\bm{x}}\|_{\ell_{2}}>\sqrt{cp}t)\leq 2pe^{-t}$ , where, the latter follows from union bounding over all entries of $\bm{x}$ . Union bounding these two events, we get the following tail bound for their product,

[TABLE]

For notational convenience, set

[TABLE]

and note that $g$ satisfies the following property due to (37)

[TABLE]

Furthermore, from (40) we get the following tail distribution

[TABLE]

for $t\geq\alpha:=\sqrt{p}B^{2}$ . Combining (41), (42) and (43) into (39) and denoting probability density function of $g$ by $f_{g}$ , we get

[TABLE]

where, (a) follows from (43). To bound the term on the right hand side, we do a change of variable in (44) by setting $\tau=[t/(c\sqrt{p})]^{1/2}$ to get,

[TABLE]

Combining this with (44), we get

[TABLE]

where, we get (a) by picking $B=C\log(n)$ with sufficiently large $C>0$ . Finally, note that conditioned on $|w|\leq B$ , ${\bm{z}}=w\bm{x}$ and

[TABLE]

Since $\mathbb{P}(|w|\leq B)>1/2$ , this yields $\|{\operatorname{\mathbb{E}}[w\bm{x}{~{}\big{|}~{}}|w|\leq B]}\|_{\ell_{2}}\lesssim{p^{2}n^{-201}}$ which is the advertised result with $\sigma=1$ .

Similarly for subgaussian samples, one can show that

[TABLE]

Picking $B=C\sqrt{\log(n)}$ with sufficiently large $C>0$ , we get the same result, concluding the proof of Lemma V.10.

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Y. Plan, R. Vershynin, and E. Yudovina, “High-dimensional estimation with geometric constraints,” Information and Inference: A Journal of the IMA , vol. 6, no. 1, pp. 1–40, 2016.
2[2] P. T. Boufounos and R. G. Baraniuk, “1-bit compressive sensing,” in Information Sciences and Systems, 2008. CISS 2008. 42nd Annual Conference on . IEEE, 2008, pp. 16–21.
3[3] R. Ganti, N. Rao, R. M. Willett, and R. Nowak, “Learning single index models in high dimensions,” ar Xiv preprint ar Xiv:1506.08910 , 2015.
4[4] S. Oymak and M. Soltanolkotabi, “Fast and reliable parameter estimation from nonlinear observations,” SIAM Journal on Optimization , vol. 27, no. 4, pp. 2276–2300, 2017.
5[5] Y. Plan, R. Vershynin, and E. Yudovina, “High-dimensional estimation with geometric constraints,” Information and Inference: A Journal of the IMA , vol. 6, no. 1, pp. 1–40, 2017.
6[6] Y. Plan and R. Vershynin, “The generalized lasso with non-linear observations,” IEEE Transactions on information theory , vol. 62, no. 3, pp. 1528–1537, 2016.
7[7] C. Thrampoulidis, E. Abbasi, and B. Hassibi, “Lasso with non-linear measurements is equivalent to one with linear measurements,” in Advances in Neural Information Processing Systems , 2015, pp. 3420–3428.
8[8] L. Jacques, J. N. Laska, P. T. Boufounos, and R. G. Baraniuk, “Robust 1-bit compressive sensing via binary stable embeddings of sparse vectors,” IEEE Transactions on Information Theory , vol. 59, no. 4, pp. 2082–2102, 2013.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Quickly Finding the Best Linear Model in High Dimensions

Abstract

Index Terms:

I Introduction

I-A Relation to Prior Work

I-B Contributions

I-C Paper Organization

II Preliminaries and Problem Formulation

Definition II.1** ((Perturbed) Gaussian width [29])**

Definition II.2** (Orlicz norms)**

III Main Results

Definition III.1** (Isotropic vector)**

Definition III.2** (σ\sigmaσ-noisy datasets)**

Theorem III.3** (Subgaussian)**

Theorem III.4** (Subexponential)**

III-A Controlling the Convergence Rate of PGD

Theorem III.5** (Convergence rate)**

III-B Bounding the Error due to Nonlinearity

Theorem III.6** (Statistical error)**

IV Numerical Experiments

V Proofs of Main Theorems

V-A Proof of Theorem III.4

V-B Proof of Theorem III.5 for subgaussian samples

Lemma V.1

Lemma V.2

V-C Proof of Theorem III.5 for subexponential samples

V-C1 Upper bounding the singular values

Theorem V.3** (Matrix Chernoff [32])**

Lemma V.4** (Spectral norm bound)**

Lemma V.5** (Spectral norm bound of expectation)**

V-C2 Lower bounding the singular values

Definition V.6** (Restricted singular value)**

Theorem V.7** (Bounding RSV [29])**

Lemma V.8** (Bounding empirical width [29])**

V-C3 Upper bounding the convergence rate

V-D Proof of Theorem III.6 for subgaussian samples

V-D1 Upper bounding the first term in (20)

Lemma V.9

Lemma V.10

V-D2 Upper bounding the second term in (20)

V-E Proof of Theorem III.6 for subexponential samples

V-E1 Upper bounding the first term in (20)

V-E2 Upper bounding the second term in (20)

VI Conclusion

VII Appendix

VII-A Proof of Lemma V.1

VII-B Proof of Lemma V.2

VII-C Proof of Lemma V.4

VII-D Proof of Lemma V.5

VII-E Proof of Lemma V.9

VII-F Proof of Lemma V.10

Definition II.1 ((Perturbed) Gaussian width [29])

Definition II.2 (Orlicz norms)

Definition III.1 (Isotropic vector)

Definition III.2 ( $\sigma$ -noisy datasets)

Theorem III.3 (Subgaussian)

Theorem III.4 (Subexponential)

Theorem III.5 (Convergence rate)

Theorem III.6 (Statistical error)

Theorem V.3 (Matrix Chernoff [32])

Lemma V.4 (Spectral norm bound)

Lemma V.5 (Spectral norm bound of expectation)

Definition V.6 (Restricted singular value)

Theorem V.7 (Bounding RSV [29])

Lemma V.8 (Bounding empirical width [29])