Do Subsampled Newton Methods Work for High-Dimensional Data?

Xiang Li; Shusen Wang; Zhihua Zhang

arXiv:1902.04952·stat.ML·May 7, 2019

Do Subsampled Newton Methods Work for High-Dimensional Data?

Xiang Li, Shusen Wang, Zhihua Zhang

PDF

Open Access

TL;DR

This paper provides a theoretical justification for the effectiveness of subsampled Newton methods in high-dimensional settings, showing they require significantly fewer samples than previously thought, especially when leveraging ridge leverage scores.

Contribution

It proves that only a small number of samples based on ridge leverage scores are needed for Hessian approximation in high-dimensional data, extending applicability to distributed and non-smooth problems.

Findings

01

Subsampled Newton methods need only ^ff_ff samples for Hessian approximation.

02

The method is effective even when data dimension is large and comparable to data size.

03

Extensions to distributed and non-smooth regularized optimization problems are provided.

Abstract

Subsampled Newton methods approximate Hessian matrices through subsampling techniques, alleviating the cost of forming Hessian matrices but using sufficient curvature information. However, previous results require $Ω (d)$ samples to approximate Hessians, where $d$ is the dimension of data points, making it less practically feasible for high-dimensional data. The situation is deteriorated when $d$ is comparably as large as the number of data points $n$ , which requires to take the whole dataset into account, making subsampling useless. This paper theoretically justifies the effectiveness of subsampled Newton methods on high dimensional data. Specifically, we prove only $Θ (d_{eff}^{γ})$ samples are needed in the approximation of Hessian matrices, where $d_{eff}^{γ}$ is the $γ$ -ridge leverage and can be much smaller than $d$ as long as $n\gamma \gg…

Equations292

w \in R^{d} min G (w) := \frac{1}{n} j = 1 \sum n l_{j} (x_{j}^{T} w) + \frac{γ}{2} ∥ w ∥_{2}^{2} + r (w)

w \in R^{d} min G (w) := \frac{1}{n} j = 1 \sum n l_{j} (x_{j}^{T} w) + \frac{γ}{2} ∥ w ∥_{2}^{2} + r (w)

w \in R^{d} min F (w) := \frac{1}{n} j = 1 \sum n l_{j} (x_{j}^{T} w) + \frac{γ}{2} ∥ w ∥_{2}^{2} .

w \in R^{d} min F (w) := \frac{1}{n} j = 1 \sum n l_{j} (x_{j}^{T} w) + \frac{γ}{2} ∥ w ∥_{2}^{2} .

w_{t + 1} = w_{t} - α_{t} H_{t}^{- 1} g_{t},

w_{t + 1} = w_{t} - α_{t} H_{t}^{- 1} g_{t},

H_{t} = \frac{1}{n} A_{t}^{T} A_{t} + γ I_{d},

H_{t} = \frac{1}{n} A_{t}^{T} A_{t} + γ I_{d},

H_{t} = \frac{1}{s} A_{t}^{T} A_{t} + γ I_{d},

H_{t} = \frac{1}{s} A_{t}^{T} A_{t} + γ I_{d},

{\bf A}={\bf U}\mbox{\boldmath$\Sigma$\unboldmath}{\bf V}^{T}=\sum_{i=1}^{d}\sigma_{i}{\bf u}_{i}{\bf v}_{i}^{T}

{\bf A}={\bf U}\mbox{\boldmath$\Sigma$\unboldmath}{\bf V}^{T}=\sum_{i=1}^{d}\sigma_{i}{\bf u}_{i}{\bf v}_{i}^{T}

l_{j}^{γ} = a_{j}^{T} (A^{T} A + nγ I_{d})^{†} a_{j} = k = 1 \sum d \frac{σ _{k}^{2}}{σ _{k}^{2} + nγ} u_{j k}^{2},

l_{j}^{γ} = a_{j}^{T} (A^{T} A + nγ I_{d})^{†} a_{j} = k = 1 \sum d \frac{σ _{k}^{2}}{σ _{k}^{2} + nγ} u_{j k}^{2},

d_{eff}^{γ} (A) = j = 1 \sum n l_{j}^{γ} = k = 1 \sum d \frac{σ _{k}^{2}}{σ _{k}^{2} + nγ} \leq d .

d_{eff}^{γ} (A) = j = 1 \sum n l_{j}^{γ} = k = 1 \sum d \frac{σ _{k}^{2}}{σ _{k}^{2} + nγ} \leq d .

μ^{γ} = \frac{n}{d _{eff}^{γ}} i \in [n] max l_{i}^{γ},

μ^{γ} = \frac{n}{d _{eff}^{γ}} i \in [n] max l_{i}^{γ},

μ^{0} = \frac{n}{d} j \in [n] max l_{j}^{0} = \frac{n}{d} j \in [n] max a_{j}^{T} (A^{T} A)^{†} a_{j}

μ^{0} = \frac{n}{d} j \in [n] max l_{j}^{0} = \frac{n}{d} j \in [n] max a_{j}^{T} (A^{T} A)^{†} a_{j}

g_{t} = \frac{1}{n} j = 1 \sum n l_{j}^{'} (x_{j}^{T} w_{t}) \cdot x_{j} + γ w_{t} \in R^{d} .

g_{t} = \frac{1}{n} j = 1 \sum n l_{j}^{'} (x_{j}^{T} w_{t}) \cdot x_{j} + γ w_{t} \in R^{d} .

H_{t} = \frac{1}{n} j = 1 \sum n l_{j}^{''} (x_{j}^{T} w_{t}) \cdot x_{j} x_{j}^{T} + γ I_{d} \in R^{d \times d} .

H_{t} = \frac{1}{n} j = 1 \sum n l_{j}^{''} (x_{j}^{T} w_{t}) \cdot x_{j} x_{j}^{T} + γ I_{d} \in R^{d \times d} .

A_{t} = [a_{1}, \dots, a_{n}]^{T} \in R^{n \times d} .

A_{t} = [a_{1}, \dots, a_{n}]^{T} \in R^{n \times d} .

H_{t} = \frac{1}{n} A_{t}^{T} A_{t} + γ I_{d} \in R^{d \times d} .

H_{t} = \frac{1}{n} A_{t}^{T} A_{t} + γ I_{d} \in R^{d \times d} .

\big{(}\tfrac{1}{s}\widetilde{\bf A}_{t}\widetilde{\bf A}_{t}^{T}+\gamma{\bf I}_{d}\big{)}\,{\bf p}\>=\>{\bf g}_{t}

\big{(}\tfrac{1}{s}\widetilde{\bf A}_{t}\widetilde{\bf A}_{t}^{T}+\gamma{\bf I}_{d}\big{)}\,{\bf p}\>=\>{\bf g}_{t}

w_{t + 1} = w_{t} - α_{t} \tilde{p}_{t},

w_{t + 1} = w_{t} - α_{t} \tilde{p}_{t},

s\>=\>\Theta\Big{(}\tfrac{\mu^{\gamma}{d_{\text{eff}}^{\gamma}}}{\varepsilon^{2}}\log\tfrac{{d_{\text{eff}}^{\gamma}}}{\delta}\Big{)},

s\>=\>\Theta\Big{(}\tfrac{\mu^{\gamma}{d_{\text{eff}}^{\gamma}}}{\varepsilon^{2}}\log\tfrac{{d_{\text{eff}}^{\gamma}}}{\delta}\Big{)},

\big{\|}\mbox{\boldmath$\Delta$\unboldmath}_{t}\big{\|}_{2}\>\leq\>\epsilon^{t}\sqrt{\kappa}\,\big{\|}\mbox{\boldmath$\Delta$\unboldmath}_{0}\big{\|}_{2}.

\big{\|}\mbox{\boldmath$\Delta$\unboldmath}_{t}\big{\|}_{2}\>\leq\>\epsilon^{t}\sqrt{\kappa}\,\big{\|}\mbox{\boldmath$\Delta$\unboldmath}_{0}\big{\|}_{2}.

s\>=\>\Theta\Big{(}\tfrac{\mu^{\gamma}{d_{\text{eff}}^{\gamma}}}{\varepsilon^{2}}\log\tfrac{{d_{\text{eff}}^{\gamma}}}{\delta}\Big{)},

s\>=\>\Theta\Big{(}\tfrac{\mu^{\gamma}{d_{\text{eff}}^{\gamma}}}{\varepsilon^{2}}\log\tfrac{{d_{\text{eff}}^{\gamma}}}{\delta}\Big{)},

\big{\|}\mbox{\boldmath$\Delta$\unboldmath}_{t+1}\big{\|}_{2}\>\leq\>\varepsilon\,\sqrt{\kappa_{t}}\,\big{\|}\mbox{\boldmath$\Delta$\unboldmath}_{t}\big{\|}_{2}+\tfrac{L}{\sigma_{\min}({\bf H}_{t})}\,\big{\|}\mbox{\boldmath$\Delta$\unboldmath}_{t}\big{\|}_{2}^{2},

\big{\|}\mbox{\boldmath$\Delta$\unboldmath}_{t+1}\big{\|}_{2}\>\leq\>\varepsilon\,\sqrt{\kappa_{t}}\,\big{\|}\mbox{\boldmath$\Delta$\unboldmath}_{t}\big{\|}_{2}+\tfrac{L}{\sigma_{\min}({\bf H}_{t})}\,\big{\|}\mbox{\boldmath$\Delta$\unboldmath}_{t}\big{\|}_{2}^{2},

s\>=\>\Theta\Big{(}\tfrac{{d_{\text{eff}}^{\gamma}}}{\varepsilon^{2}}\log\tfrac{{d_{\text{eff}}^{\gamma}}}{\delta}\Big{)}.

s\>=\>\Theta\Big{(}\tfrac{{d_{\text{eff}}^{\gamma}}}{\varepsilon^{2}}\log\tfrac{{d_{\text{eff}}^{\gamma}}}{\delta}\Big{)}.

s\>=\>\Theta\Big{(}\tfrac{n{\kappa_{t}}}{\varepsilon^{2}(1-\epsilon\kappa_{t})^{2}}\,\tfrac{\max_{i}\|{\bf a}_{i}\|_{2}^{2}}{\|{\bf A}\|_{2}^{2}}\,\log\tfrac{d}{\delta}\Big{)}.

s\>=\>\Theta\Big{(}\tfrac{n{\kappa_{t}}}{\varepsilon^{2}(1-\epsilon\kappa_{t})^{2}}\,\tfrac{\max_{i}\|{\bf a}_{i}\|_{2}^{2}}{\|{\bf A}\|_{2}^{2}}\,\log\tfrac{d}{\delta}\Big{)}.

s\>=\>\tilde{\Theta}\Big{(}\tfrac{1}{\varepsilon^{2}(1-\epsilon\kappa_{t})^{2}}\,\tfrac{\sigma_{\max}({\bf A}_{t}^{T}{\bf A}_{t})+n\gamma}{\sigma_{\max}({\bf A}_{t}^{T}{\bf A}_{t})}\,\sum_{i=1}^{d}\tfrac{\sigma_{i}({\bf A}_{t}^{T}{\bf A}_{t})}{\sigma_{\min}({\bf A}_{t}^{T}{\bf A}_{t})+n\gamma}\Big{)}

s\>=\>\tilde{\Theta}\Big{(}\tfrac{1}{\varepsilon^{2}(1-\epsilon\kappa_{t})^{2}}\,\tfrac{\sigma_{\max}({\bf A}_{t}^{T}{\bf A}_{t})+n\gamma}{\sigma_{\max}({\bf A}_{t}^{T}{\bf A}_{t})}\,\sum_{i=1}^{d}\tfrac{\sigma_{i}({\bf A}_{t}^{T}{\bf A}_{t})}{\sigma_{\min}({\bf A}_{t}^{T}{\bf A}_{t})+n\gamma}\Big{)}

s\>=\>\Theta\big{(}\tfrac{d}{\varepsilon^{2}}\,\log\tfrac{d}{\delta}\big{)},

s\>=\>\Theta\big{(}\tfrac{d}{\varepsilon^{2}}\,\log\tfrac{d}{\delta}\big{)},

w_{t + 1} = w_{t} - α_{t} g_{t},

w_{t + 1} = w_{t} - α_{t} g_{t},

H_{t, i} = \frac{1}{s} A_{t, i}^{T} A_{t, i} + γ I_{d}

H_{t, i} = \frac{1}{s} A_{t, i}^{T} A_{t, i} + γ I_{d}

p_{t, i} = H_{t, i}^{- 1} g_{t} .

p_{t, i} = H_{t, i}^{- 1} g_{t} .

p_{t} = \frac{1}{m} i = 1 \sum m p_{t, i}

p_{t} = \frac{1}{m} i = 1 \sum m p_{t, i}

w_{t + 1} = w_{t} - α_{t} p_{t},

w_{t + 1} = w_{t} - α_{t} p_{t},

s\>=\>\Theta\Big{(}\tfrac{\mu^{\gamma}{d_{\text{eff}}^{\gamma}}}{\varepsilon}\log\tfrac{m{d_{\text{eff}}^{\gamma}}}{\delta}\Big{)},

s\>=\>\Theta\Big{(}\tfrac{\mu^{\gamma}{d_{\text{eff}}^{\gamma}}}{\varepsilon}\log\tfrac{m{d_{\text{eff}}^{\gamma}}}{\delta}\Big{)},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods

Full text

Do Subsampled Newton Methods Work for High-Dimensional Data?

Xiang Li1

Shusen Wang2

Zhihua Zhang1 1School of Mathematical Sciences, Peking University, China

2Department of Computer Science, Stevens Institute of Technology, USA

[email protected], [email protected], [email protected]

Abstract

Subsampled Newton methods approximate Hessian matrices through subsampling techniques, alleviating the cost of forming Hessian matrices but using sufficient curvature information. However, previous results require $\Omega(d)$ samples to approximate Hessians, where $d$ is the dimension of data points, making it less practically feasible for high-dimensional data. The situation is deteriorated when $d$ is comparably as large as the number of data points $n$ , which requires to take the whole dataset into account, making subsampling useless. This paper theoretically justifies the effectiveness of subsampled Newton methods on convex empirical risk minimization with high dimensional data. Specifically, we provably need only $\widetilde{\Theta}({d_{\text{eff}}^{\gamma}})$ samples the approximation of Hessian matrices, where ${d_{\text{eff}}^{\gamma}}$ is the $\gamma$ -ridge leverage and can be much smaller than $d$ as long as $n\gamma\gg 1$ . Additionally, we extend this result so that subsampled Newton methods can work for high-dimensional data on both distributed optimization problems and non-smooth regularized problems.

1 Introduction

Let ${\bf x}_{1},...,{\bf x}_{n}\in{\mathbb{R}}^{d}$ be the feature vectors, $l_{i}(\cdot)$ is a convex, smooth, and twice differentiable loss function; the response $y_{i}$ is captured by $l_{i}$ . In this paper, we study the following optimization problem:

[TABLE]

where $r(\cdot)$ is a non-smooth convex function. We first consider the simple case where $r$ is zero, i.e.,

[TABLE]

Such a convex optimization problem (2) arises frequently in machining learning Shalev Shwartz and Ben David (2014). For example, in logistic regression, $l_{j}({\bf x}_{j}^{T}{\bf w})=\log(1+\exp(-y_{j}{\bf x}_{j}^{T}{\bf w}))$ , and in linear regression, $l_{j}({\bf x}_{j}^{T}{\bf w})=\frac{1}{2}({\bf x}_{j}^{T}{\bf w}-y_{j})^{2}$ . Then we consider the more general case where $r$ is non-zero, e.g., LASSO Tibshirani (1996) and elastic net Zou and Hastie (2005).

To solve (2), many first order methods have been proposed. First-order methods solely exploit information in the objective function and its gradient. Accelerated gradient descent Golub and Van Loan (2012); Nesterov (2013); Bubeck (2014), stochastic gradient descent Robbins and Monro (1985), and their variants Lin et al. (2015); Johnson and Zhang (2013); Schmidt et al. are the most popular approaches in practice due to their simplicity and low per-iteration time complexity. As pointed out by Xu et al. (2017), the downsides of first-order methods are the slow convergence to high-precision and the sensitivity to condition number and hyper-parameters.

Second-order methods use not only the gradient but also information in the Hessian matrix in their update. In particular, the Newton’s method, a canonical second-order method, has the following update rule:

[TABLE]

where the gradient ${\bf g}_{t}=\nabla F({\bf w}_{t})$ is the first derivative of the objective function at ${\bf w}_{t}$ , the Hessian ${\bf H}_{t}=\nabla^{2}F({\bf w}_{t})$ is the second derivative at ${\bf w}_{t}$ , and $\alpha_{t}$ is the step size and can be safely set as one under certain conditions. Compared to the first-order methods, Newton’s method requires fewer iterations and are more robust to the hyper-parameter setting, and guaranteed super-linear local convergence to high-precision. However, Newton’s method is slow in practice, as in each iteration many Hessian-vector products are required to solve the inverse problem ${\bf H}_{t}{\bf p}={\bf g}_{t}$ . Quasi-Newton methods use information from the history of updates to construct Hessian Dennis and Moré (1977). Well-known works include Broyden-Fletcher-Goldfarb-Shanno (BFGS) Wright and Nocedal (1999) and its limited memory version (L-BFGS) Liu and Nocedal (1989), but their convergence rates are not comparable to Newton’s method.

Recent works proposed the Sub-Sampled Newton (SSN) methods to reduce the per-iteration complexity of the Newton’s method Byrd et al. (2011); Pilanci and Wainwright (2015); Roosta Khorasani and Mahoney (2016); Pilanci and Wainwright (2017); Xu et al. (2017); Berahas et al. (2017); Ye et al. (2017). For the particular problem (2), the Hessian matrix can be written in the form

[TABLE]

for some $n\times d$ matrix ${\bf A}_{t}$ whose $i$ -th row is a scaling of ${\bf x}_{i}$ . The basic idea of SSN is to sample and scale $s$ ( $s\ll n$ ) rows of ${\bf A}$ to form $\widetilde{\bf A}_{t}\in{\mathbb{R}}^{s\times d}$ and approximate ${\bf H}_{t}$ by

[TABLE]

The quality of Hessian approximation is guaranteed by random matrix theories Tropp (2015); Woodruff (2014), based on which the convergence rate of SSN is established.

As the second-order methods perform heavy computation in each iteration and converge in a small number of iterations, they have been adapted to solve distributed machine learning aiming at reducing the communication cost Shamir et al. (2014); Mahajan et al. (2015); Zhang and Lin (2015); Reddi et al. (2016); Shusen Wang et al. (2018). In particular, the Globally Improved Approximate NewTon Method (GIANT) method is based on the same idea as SSN and has fast convergence rate.

As well as Newton’s method, SSN is not directly applicable for (1) because the objective function is non-smooth. Following the proximal-Newton method Lee et al. (2014), SSN has been adapted to solve convex optimization with non-smooth regularization Liu et al. (2017). SSN has also been applied to optimize nonconvex problem Xu et al. (2017); Tripuraneni et al. (2017).

1.1 Our contributions

Recall that $n$ is the total number of samples, $d$ is the number of features, and $s$ is the size of the randomly sampled subset. (Obviously $s\ll n$ , otherwise the subsampling does not speed up computation.) The existing theories of SSN require $s$ to be at least $\Omega(d)$ . For the big-data setting, i.e., $d\ll n$ , the existing theories nicely guarantee the convergence of SSN.

However, high-dimensional data is not uncommon at all in machine learning; $d$ can be comparable to or even greater than $n$ . Thus requiring both $s\ll n$ and $s=\Omega(d)$ seriously limits the application of SSN. We considers the question:

Do SSN and its variants work for (1) when $s<d$ ?

The empirical studies in Xu et al. (2016, 2017); Shusen Wang et al. (2018) indicate that yes, SSN and its extensions have fast convergence even if $s$ is substantially smaller than $d$ . However, their empirical observations are not supported by theory.

This work bridges the gap between theory and practice for convex empirical risk minimization. We show it suffices to use $s=\tilde{\Theta}({d_{\text{eff}}^{\gamma}})$ uniformly sampled subset to approximate the Hessian, where $\gamma$ is the regularization parameter, ${d_{\text{eff}}^{\gamma}}$ ( $\leq d$ ) is the $\gamma$ -effective-dimension of the $d\times d$ Hessian matrix, and $\tilde{\Theta}$ hides the constant and logarithmic factors. If $n\gamma$ is larger than most of the $d$ eigenvalues of the Hessian, then ${d_{\text{eff}}^{\gamma}}$ is tremendously smaller than $d$ Cohen et al. (2015). Our theory is applicable to three SSN methods.

•

In Section 3, we study the convex and smooth problem (2). we show the convergence of the standard SSN with the effective-dimension dependence and improves Xu et al. (2016).

•

In Section 4, for the same optimization problem (2), we extend the result to the distributed computing setting and improves the bound of GIANT Shusen Wang et al. (2018).

•

In Section 5, we study a convex but nonsmooth problem (1) and analyze the combination of SSN and proximal-Newton.

In Section 6, we analyse SSN methods when the subproblems are inexactly solved. The proofs of the main theorems are in the appendix.

2 Notation and Preliminary

Basic matrix notation.

Let ${\bf I}_{n}$ be the $n\times n$ indentity matrix. Let $\|{\bf a}\|_{2}$ denote the vector $\ell_{2}$ norm and $\|{\bf A}\|_{2}$ denote the matrix spectral norm. Let

[TABLE]

be its singular value decomposition (SVD), with $\sigma_{\max}({\bf A})$ its largest singular value and $\sigma_{\min}({\bf A})$ the smallest (the $d$ -th largest). The moore-Penrose inverse of ${\bf A}$ is defined by ${\bf A}^{\dagger}={\bf V}\mbox{\boldmath$ \Sigma $\unboldmath}^{-1}{\bf U}^{T}$ . If a symmetric real matrix has no negative eigenvalues, it is called symmetric positive semidefinite (SPSD). We denote ${\bf A}\preceq{\bf B}$ if ${\bf B}-{\bf A}$ is SPSD. For the SPD matrice ${\bf H}$ , we define a norm by $\|{\bf x}\|_{{\bf H}}=\sqrt{{\bf x}^{T}{\bf H}{\bf x}}$ and its conditional number by $\kappa({\bf H})=\frac{\sigma_{\max}({\bf H})}{\sigma_{\min}({\bf H})}$ .

Ridge leverage scores.

For ${\bf A}=[{\bf a}_{1}^{T};\cdots;{\bf a}_{n}^{T}]\in{\mathbb{R}}^{n\times d}$ , its row $\gamma$ -ridge leverage score ( $\gamma\geq 0$ ) is defined by

[TABLE]

for $j\in[n]\triangleq\{1,2,...,n\}$ . Here $\sigma_{k}$ and ${\bf u}_{k}$ are defined in (5). For $\gamma=0$ , $l_{j}^{\gamma}$ is the standard leverage score used by Drineas et al. (2008); Michael W. Mahoney (2011).

Effective dimension.

The $\gamma$ -effective dimension of ${\bf A}\in{\mathbb{R}}^{n\times d}$ is defined by

[TABLE]

If $n\gamma$ is larger than most of the singular values of ${\bf A}^{T}{\bf A}$ , then ${d_{\text{eff}}^{\gamma}}({\bf A})$ is tremendously smaller than $d$ Alaoui and Mahoney (2015); Cohen et al. (2017). In fact, to trade-off the bias and variance, the optimal setting of $\gamma$ makes $n\gamma$ comparable to the top singular values of ${\bf A}^{T}{\bf A}$ Hsu et al. (2014); Wang et al. (2018), and thus ${d_{\text{eff}}^{\gamma}}({\bf A})$ is small in practice.

Ridge coherence.

The row $\gamma$ -ridge coherence of ${\bf A}\in{\mathbb{R}}^{n\times d}$ is

[TABLE]

which measures the extent to which the information in the rows concentrates. If ${\bf A}$ has most of its mass in a relatively small number of rows, its $\gamma$ -ridge coherence could be high. This concept is necessary for matrix approximation via uniform sampling. It could be imagined that if most information is in a few rows, which means high coherence, then uniform sampling is likely to miss some of the important rows, leading to low approximation quality. When $\gamma=0$ , it coincides with the standard row coherence

[TABLE]

which is widely used to analyze techniques such as compressed sensing Candes et al. (2006), matrix completion Candès and Recht (2009), robust PCA Candès et al. (2011) and so on.

Gradient and Hessian.

For the optimization problem (2), the gradient of $F(\cdot)$ at ${\bf w}_{t}$ is

[TABLE]

The Hessian matrix at ${\bf w}_{t}$ is

[TABLE]

Let ${\bf a}_{j}=\sqrt{l_{j}^{{}^{\prime\prime}}({\bf x}_{i}^{T}{\bf w}_{t})}\cdot{\bf x}_{j}\in{\mathbb{R}}^{d}$ and

[TABLE]

In this way, the Hessian matrix can be expressed as

[TABLE]

3 Sub-Sampled Newton (SSN)

In this section, we provide new and stronger convergence guarantees for the SSN methods. For SSN with uniform sampling, we require a subsample size of $s=\tilde{\Theta}(\mu^{\gamma}{d_{\text{eff}}^{\gamma}})$ ; For SSN with ridge leverage score sampling,111We do not describe the ridge leverage score sampling in detail; the readers can refer to Alaoui and Mahoney (2015); Cohen et al. (2015). a smaller sample size, $s=\tilde{\Theta}({d_{\text{eff}}^{\gamma}})$ , suffices. Because ${d_{\text{eff}}^{\gamma}}$ is typically much smaller than $d$ , our new results guarantee convergence when $s<d$ .

3.1 Algorithm description

We set an interger $s$ ( $\ll n$ ) and uniformly sample $s$ items out of $[n]$ to form the subset ${\mathcal{S}}$ . In the $t$ -th iteration, we form the matrix $\tilde{{\bf A}}_{t}\in{\mathbb{R}}^{s\times d}$ which contains the rows of ${\bf A}_{t}\in{\mathbb{R}}^{n\times d}$ indexed by ${\mathcal{S}}$ and the full gradient ${\bf g}_{t}$ . Then, the approximately Newton direction $\tilde{{\bf p}}_{t}$ is computed by solving the linear system

[TABLE]

by either matrix inversion or the conjugate gradient. Finally, ${\bf w}$ is updated by

[TABLE]

where $\alpha_{t}$ can be set to one or found by line search. In the rest of this section, we only consider $\alpha_{t}=1$ .

Most of the computation is performed in solving (11). The only difference between the standard Newton and the SSN methods is replacing ${\bf A}_{t}\in{\mathbb{R}}^{n\times d}$ by $\tilde{{\bf A}}_{t}\in{\mathbb{R}}^{s\times d}$ . Compared to Newton’s method, SSN leads to an almost $\frac{n}{s}$ -factor speed up of the per-iteration computation; however, SSN requires more iterations to converge. Nevertheless, to reach a fixed precision, the overall cost of SSN is much lower than Newton’s method.

3.2 Our improved convergence bounds

Improved bound for quadratic loss.

We let ${\bf w}^{\star}$ be the unique (due to the strong convexity) optimal solution to the problem 1, ${\bf w}_{t}$ be the intermediate output of the $t$ -th iteration, and $\mbox{\boldmath$ \Delta $\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{\star}$ . If the loss function of (1) is quadratic, e.g., $l_{j}({\bf x}_{j}^{T}{\bf w})=\frac{1}{2}({\bf x}_{j}^{T}{\bf w}-y_{j})^{2}$ , the Hessian matrix ${\bf H}_{t}=\frac{1}{n}{\bf A}_{t}^{T}{\bf A}_{t}+\gamma{\bf I}_{d}$ does not change with the iteration, so we use ${\bf H}$ and ${\bf A}$ instead. Theorem 1 guarantees the global convergence of SSN.

Theorem 1 (Global Convergence).

*Let ${d_{\text{eff}}^{\gamma}}$ and $\mu^{\gamma}$ respectively be the $\gamma$ -ridge leverage score and $\gamma$ -coherence of ${\bf A}$ , and $\kappa$ be the condition number of ${\bf H}$ . Let $\varepsilon\in(0,\frac{1}{4})$ and $\delta\in(0,1)$ be any user-specified constants. Assume the loss function of (1) is quadratic. For a sufficiently large sub-sample size: *

[TABLE]

*with probability at least $1-\delta$ , *

[TABLE]

Proof.

We prove the theorem in Appendix B.2. ∎

Improved bound for non-quadratic loss.

If the loss function of (1) is non-quadratic, the Hessian matrix ${\bf H}_{t}$ changes with iteration, and we can only guarantee fast local convergence, as well as the prior works Roosta Khorasani and Mahoney (2016); Xu et al. (2016). We make a standard assumption on the Hessian matrix, which is required by all the prior works on Newton-type methods.

Assumption 1.

The Hessian matrix $\nabla^{2}F({\bf w})$ is $L$ -Lipschitz continuous, i.e., $\|\nabla^{2}F({\bf w})-\nabla^{2}F({\bf w}^{\prime})\|_{2}\leq L\|{\bf w}-{\bf w}^{\prime}\|_{2}$ , for arbitrary ${\bf w}$ and ${\bf w}^{\prime}$ .

Theorem 2 (Local Convergence).

*Let ${d_{\text{eff}}^{\gamma}},\mu^{\gamma}$ respectively be the $\gamma$ -ridge leverage score and $\gamma$ -coherence of ${\bf A}_{t}$ . Let $\varepsilon\in(0,\frac{1}{4})$ and $\delta\in(0,1)$ be any user-specified constants. Let Assumption 1 be satisfied. For a sufficiently large sub-sample size: *

[TABLE]

*with probability at least $1-\delta$ , *

[TABLE]

where $\kappa_{t}=\tfrac{\sigma_{\max}({\bf H}_{t})}{\sigma_{\min}({\bf H}_{t})}$ is the condition number.

Proof.

We prove the theorem in Appendix B.3. ∎

Theorem 3.

*If ridge leverage score sampling is used instead, the sample complexity in Theorems 1 and 2 will be improved to *

[TABLE]

Remark 1.

Ridge leverage score sampling eliminates the dependence on the coherence, and the bound is stronger than all the existing sample complexities for SSN. We prove the corollary in Appendix B.4. However, the ridge leverage score sampling is expensive and impractical and thus has only theoretical interest.

Although Newton-type methods empirically demonstrate fast global convergence in almost all the real-world applications, they do not have strong global convergence guarantee. A weak global convergence bound for SSN was established by Roosta Khorasani and Mahoney (2016). We do not further discuss the global convergence issue in this paper.

3.3 Comparison with prior work

For SSN with uniform sampling, the prior work Roosta Khorasani and Mahoney (2016) showed that to obtain the same convergence bounds as ours, (12) and (13), the sample complexity should be

[TABLE]

In comparison, to obtain a same convergence rate, our sample complexity has a better dependence on the condition number and the dimensionality.

For the row norm square sampling of Xu et al. (2016), which is slightly more expensive than uniform sampling, a sample complexity of

[TABLE]

suffices for the same convergence rates as ours, (12) and (13). Their bound may or may not guarantee convergence for $s<d$ . Even if $n\gamma$ is larger than most of the singular values of ${\bf A}_{t}^{T}{\bf A}_{t}$ , their required sample complexity can be large.

For leverage score sampling, Xu et al. (2016) showed that to obtain the same convergence bounds as ours, (12) and (13), the sample complexity should be

[TABLE]

which depends on $d$ (worse than ours ${d_{\text{eff}}^{\gamma}}$ ) but does not depend on coherence. We show that if the ridge leverage score sampling is used, then $s=\Theta\big{(}\tfrac{{d_{\text{eff}}^{\gamma}}}{\varepsilon^{2}}\log\tfrac{{d_{\text{eff}}^{\gamma}}}{\delta}\big{)}$ samples suffices, which is better than the above sample complexity. However, because approximately computing the (ridge) leverage scores is expensive, neither the leverage score sampling of Xu et al. (2016) nor the ridge leverage score sampling proposed by us is a practical choice.

4 Distributed Newton-Type Method

Communication-efficient distributed optimization is an important research field, and second-order methods have been developed to reduce the communication cost, e.g., DANE Shamir et al. (2014), AIDE Reddi et al. (2016), DiSCO Zhang and Lin (2015) and GIANT Shusen Wang et al. (2018). Among them, GIANT has the strongest convergence bound. In this section, we further improve the convergence analysis of GIANT and show that GIANT does converge when the local sample size, $s=\frac{n}{m}$ , is smaller the number of features, $d$ .

4.1 Motivation and algorithm description

Assume the $n$ samples are partition among $m$ worker machines uniformly at random. Each worker machine has its own processors and memory, and the worker machines can communicate by message passing. The communication are costly compared to the local computation; when the number of worker machines is large, the communication is oftentimes the bottleneck of distributed computing. Thus there is a strong desire to reduce the communication cost of distributed computing. Our goal is to solve the optimization problem (1) in a communication-efficient way.

The first-order methods are computation-efficient but not communication-efficient. Let us take the gradient descent for example. In each iteration, with the iteration ${\bf w}_{t}$ at hand, the $i$ -th worker machine uses its local data to compute a local gradient ${\bf g}_{t,i}$ ; Then the driver machine averages the local gradient to form the exact gradient ${\bf g}_{t}$ and update the model by

[TABLE]

where $\alpha_{t}$ is the step size. Although each iteration is computationally efficient, the first-order methods (even with acceleration) take many iterations to converge, especially when the condition number is big. As each iteration requires broadcasting ${\bf w}_{t}$ and an aggregation of the local gradients to form ${\bf g}_{t}$ , the total number and complexity of communication are big.

Many second-order methods have been developed to improve the communication-efficiency, among which the Globally Improved Approximate NewTon (GIANT) method Shusen Wang et al. (2018) has the strongest convergence rates. Let $s=\frac{n}{m}$ be the local sample size and ${\bf A}_{t,i}\in{\mathbb{R}}^{s\times d}$ be the block of ${\bf A}_{t}\in{\mathbb{R}}^{n\times d}$ , which is previously defined in (9), formed by the $i$ -th worker machine. With the iteration ${\bf w}_{t}$ at hand, the $i$ -th worker machine can use its local data samples to form the local Hessian matrix

[TABLE]

and outputs the local Approximate NewTon (ANT) direction

[TABLE]

Finally, the driver machine averages the ANT direction

[TABLE]

and perform the update

[TABLE]

where the step size $\alpha_{t}$ can be set to one under certain conditions; we only consider the $\alpha_{t}$ case in the rest of this section.

GIANT is much more communication-efficient than the first-order methods. With $\alpha_{t}$ fixed, each iteration of GIANT has four rounds of communications: (1) broadcasting ${\bf w}_{t}$ , (2) aggregating the local gradients to form ${\bf g}_{t}$ , (3) broadcasting ${\bf g}_{t}$ , and (4) aggregating the ANT directions to form $\widetilde{\bf p}_{t}$ ; thus the per-iteration communication cost is just twice as much as a first-order method. Shusen Wang et al. (2018) showed that GIANT requires a much smaller number of iterations than the accelerated gradient method which has the optimal iteration complexity (without using second-order information).

4.2 Our improved convergence bounds

We analyze the GIANT method and improve the convergence analysis of Shusen Wang et al. (2018), which was the strongest theory in terms of communication efficiency. Throughout this section, we assume the $n$ samples are partitioned to $m$ worker machine uniformly at random.

Improved bound for quadratic loss.

We let ${\bf w}^{\star}$ be the unique optimal solution to the problem 1 and $\mbox{\boldmath$ \Delta $\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{\star}$ . If the loss function of (1) is quadratic, e.g., $l_{i}({\bf x}_{i}^{T}{\bf w})=\frac{1}{2}({\bf x}_{i}^{T}{\bf w}-y_{i})^{2}$ , the Hessian matrix ${\bf H}_{t}=\frac{1}{n}{\bf A}_{t}^{T}{\bf A}_{t}+\gamma{\bf I}_{d}$ does not change with the iteration, so we use ${\bf H}$ and ${\bf A}$ instead. Theorem 4 guarantees the global convergence of GIANT.

Theorem 4 (Global Convergence).

*Let ${d_{\text{eff}}^{\gamma}},\mu^{\gamma}$ respectively be the $\gamma$ -ridge leverage score and $\gamma$ -coherence of ${\bf A}$ , and $\kappa$ be the condition number of ${\bf H}$ . Let $\varepsilon\in(0,\frac{1}{4})$ and $\delta\in(0,1)$ be any user-specified constants. Assume the loss function of (1) is quadratic. For a sufficiently large sub-sample size: *

[TABLE]

*with probability at least $1-\delta$ , *

[TABLE]

Proof.

We prove the theorem in Appendix C.2. ∎

Improved bound for non-quadratic loss.

If the loss function of (1) is non-quadratic, we can only guarantee fast local convergence under Assumption 1, as well as the prior works Shusen Wang et al. (2018).

Theorem 5 (Local Convergence).

*Let ${d_{\text{eff}}^{\gamma}},\mu^{\gamma}$ respectively be the $\gamma$ -ridge leverage score and $\gamma$ -coherence of ${\bf A}_{t}$ . Let $\varepsilon\in(0,\frac{1}{4})$ and $\delta\in(0,1)$ be any user-specified constants. Let Assumption 1 be satisfied. For a sufficiently large sub-sample size: *

[TABLE]

*with probability at least $1-\delta$ , *

[TABLE]

where $\kappa_{t}=\tfrac{\sigma_{\max}({\bf H}_{t})}{\sigma_{\min}({\bf H}_{t})}$ is the condition number.

Proof.

We prove the theorem in Appendix C.3 ∎

Remark 2.

GIANT is a variant of SSN: SSN uses one of $\{\widetilde{\bf p}_{t,i}\}_{i=1}^{m}$ as the descending direction, whereas GIANT uses the averages of the $m$ directions. As a benefit of the averaging, the sample complexity is improved from $s=\tilde{\Theta}\big{(}\tfrac{{d_{\text{eff}}^{\gamma}}}{\epsilon^{2}}\big{)}$ to $s=\tilde{\Theta}\big{(}\tfrac{{d_{\text{eff}}^{\gamma}}}{\epsilon}\big{)}$ .

4.3 Comparison with prior work

To guarantee the same convergence bounds, (15) and (16), Shusen Wang et al. require a sample complexity of $s=\Theta(\tfrac{\mu^{0}d}{\varepsilon}\log\tfrac{d}{\delta})$ .222The sample complexity in Shusen Wang et al. (2018) is actually slightly worse; but it is almost trivial to improve their result to what we showed here. This requires require the local sample size $s=\frac{n}{m}$ be greater than $d$ , even if the coherence $\mu^{0}$ is small. As communication and synchronization costs grow with $m$ , the communication-efficient method, GIANT, is most useful for the large $m$ setting; in this case, the requirement $n>md$ is unlikely satisfied.

In contrast, our improved bounds do not require $n>md$ . As ${d_{\text{eff}}^{\gamma}}$ can be tremendously smaller than $d$ , our requirement can be satisfied even if $m$ and $d$ are both large. Our bounds match the empirical observation of Shusen Wang et al. (2018): GIANT convergences rapidly even if $md$ is larger than $n$ .

5 Sub-Sampled Proximal Newton (SSPN)

In the previous sections, we analyze second-order methods for the optimization problem (1) which has a smooth objective function. In this section, we study a harder problem:

[TABLE]

where $r$ is a non-smooth function. The standard Newton’s method does not apply because the second derivative of the objective function does not exist. Proximal Newton Lee et al. (2014), a second-order method, was developed to solve the problem, and later on, sub-sampling was incorporated to speed up computation Liu et al. (2017). We further improve the bounds of Sub-Sampled Proximal Newton (SSPN).

5.1 Algorithm Description

Let $F({\bf w})=\frac{1}{n}\sum_{j=1}^{n}l_{j}({\bf x}_{j}^{T}{\bf w})+\frac{\gamma}{2}\|{\bf w}\|_{2}^{2}$ be the smooth part of the objective function, and ${\bf g}_{t}$ and ${\bf H}_{t}$ be its first and second derivatives at ${\bf w}_{t}$ . The proximal Newton method Lee et al. (2014) iterative solves the problem:

[TABLE]

and then perform the update ${\bf w}_{t+1}={\bf w}_{t}-{\bf p}_{t}$ . The righthand side of the problem is a local quadratic approximation to $F({\bf w})$ at ${\bf w}_{t}$ . If $r(\cdot)=0$ , then proximal Newton is the same as the standard Newton’s method.

The sub-sampled proximal Newton (SSPN) method uses sub-sampling to approximate ${\bf H}_{t}$ ; let the approximate Hessian matrix be $\widetilde{{\bf H}}_{t}$ , as previously defined in (4). SSPN computes the ascending direction by solving the local quadratic approximation:

[TABLE]

and then perform the update ${\bf w}_{t+1}={\bf w}_{t}-\widetilde{\bf p}_{t}$ .

5.2 Our improved error convergence bounds

We show that SSPN has exactly the same iteration complexity as SSN, for either quadratic or non-quadratic function $l_{j}(\cdot)$ . Nevertheless, the overall time complexity of SSPN is higher than SSN, as the subproblem (17) is expensive to solve if $r(\cdot)$ is non-smooth.

Theorem 6.

Theorems 1, 2, and 3 hold for SSPN.

Proof.

We prove the theorem in Appendix D.3 and D.4. ∎

5.3 Comparison with prior work

Liu et al. (2017) showed that when $\|\mbox{\boldmath$ \Delta $\unboldmath}_{t}\|_{2}$ is small enough, $\|\mbox{\boldmath$ \Delta $\unboldmath}_{t+1}\|_{2}$ will converge to zero linear-quadratically, similar to our results. But their sample complexity is

[TABLE]

This requires the sample size to be greater than $d$ . The $\ell_{1}$ regularization is often used for high-dimensional data, the requirement that $d<s\ll n$ is too restrictive.

Our improved bounds show that $s=\tilde{\Theta}(\tfrac{{d_{\text{eff}}^{\gamma}}\mu^{\gamma}}{\varepsilon^{2}})$ suffices for uniform sampling and that $s=\tilde{\Theta}(\tfrac{{d_{\text{eff}}^{\gamma}}}{\varepsilon^{2}})$ suffices for ridge leverage score sampling. Since ${d_{\text{eff}}^{\gamma}}$ can be tremendously smaller than $d$ when $n\gamma\gg 1$ , our bounds are useful for high-dimensional data.

6 Inexactly Solving the Subproblems

Each iteration of SSN (Section 3) and GIANT (Section 4) involves solving a subproblem in the form

[TABLE]

Exactly solving this problem would perform the multiplication $\widetilde{\bf A}_{t}^{T}\widetilde{\bf A}_{t}$ and decompose the $d\times d$ approximate Hessian matrix $\tfrac{1}{s}\widetilde{\bf A}_{t}^{T}\widetilde{\bf A}_{t}+\gamma{\bf I}_{d}$ ; the time complexity is ${\mathcal{O}}(sd^{2}+d^{3})$ . In practice, it can be approximately solved by the conjugate gradient (CG) method, each iteration of which applies a vector to $\widetilde{\bf A}_{t}$ and $\widetilde{\bf A}_{t}^{T}$ ; the time complexity is ${\mathcal{O}}(q\cdot\mathsf{nnz}({\bf A}))$ , where $q$ is the number of CG iterations and $\mathsf{nnz}$ is the number of nonzeros. The inexact solution is particularly appealing if the data are sparse. In the following, we analyze the effect of the inexact solution of the subproblem.

Let $\kappa_{t}$ be the condition number of $\widetilde{\bf H}_{t}$ . For smooth problems, Shusen Wang et al. (2018) showed that by performing

[TABLE]

CG iterations, the conditions (18) and (19) are satisfied, and the inexact solution does not much affect the convergence of SSN and GIANT.

Corollary 7 (SSN).

*Let $\widetilde{\bf p}_{t}$ and $\widetilde{\bf p}_{t}^{\prime}$ be respectively the exact and an inexact solution to the quadratic problem $\widetilde{\bf H}_{t}^{-1}{\bf p}={\bf g}_{t}$ . SSN updates ${\bf w}$ by ${\bf w}_{t+1}={\bf w}_{t}-\tilde{{\bf p}}_{t}^{\prime}$ . If the condition *

[TABLE]

is satisfied for some $\varepsilon_{0}\in(0,1)$ , then Theorems 1 and 2, with $\varepsilon$ in (12) and (13) replaced by $\varepsilon+\varepsilon_{0}$ , continue to hold.

Proof.

We prove the corollary in Appendix E.1. ∎

Corollary 8 (GIANT).

*Let $\tilde{{\bf p}}_{t,i}$ and $\tilde{{\bf p}}_{t,i}^{\prime}$ be respectively the exact and an inexact solution to the quadratic problem $\widetilde{\bf H}_{t,i}^{-1}{\bf p}={\bf g}_{t}$ . GIANT updates ${\bf w}$ by ${\bf w}_{t+1}={\bf w}_{t}-\frac{1}{m}\sum_{i=1}^{m}\widetilde{\bf p}_{t,i}^{\prime}$ . If the condition *

[TABLE]

is satisfied for some $\varepsilon_{0}\in(0,1)$ and all $i\in[m]$ , then Theorems 4 and 5, with $\varepsilon$ in (15) and (16) replaced by $\varepsilon+\varepsilon_{0}$ , continue to hold.

Proof.

The corollary can be proved in almost the same way as Shusen Wang et al. (2018). So we do not repeat the proof. ∎

SSPN is designed for problems with non-smooth regularization, in which case finding the exact solution may be infeasible, and the sub-problem can only be inexactly solved. If the inexact satisfies the same condition (18), Corollary 9 will guarantee the convergence rate of SSPN.

Corollary 9 (SSPN).

Let $\tilde{{\bf p}}_{t}$ and $\tilde{{\bf p}}_{t}^{\prime}$ be respectively the exact and an inexact solution to the non-smooth problem (17). SSPN updates ${\bf w}$ by ${\bf w}_{t+1}={\bf w}_{t}-\tilde{{\bf p}}_{t}^{\prime}$ . If $\widetilde{\bf p}_{t}^{\prime}$ satisfies the condition (18) for any $\varepsilon_{0}\in(0,1)$ , then Theorems 6 still holds for SSPN with $\varepsilon$ replaced by $\varepsilon+\varepsilon_{0}$ .

Proof.

We prove the corollary in Appendix E.2. ∎

7 Conclusion

We studied the subsampled Newton (SSN) method and its variants, GIANT and SSPN, and established stronger convergence guarantees than the prior works. In particular, we showed that a sample size of $s=\tilde{\Theta}({d_{\text{eff}}^{\gamma}})$ suffices, where $\gamma$ is the $\ell_{2}$ regularization parameter and ${d_{\text{eff}}^{\gamma}}$ is the effective dimension. When $n\gamma$ is larger than most of the eigenvalues of the Hessian matrices, ${d_{\text{eff}}^{\gamma}}$ is much smaller than the dimension of data, $d$ . Therefore, our work guarantees the convergence of SSN, GIANT, and SSPN on high-dimensional data where $d$ is comparable to or even greater than $n$ . In contrast, all the prior works required a conservative sample size $s=\Omega(d)$ to attain the same convergence rate as ours. Because subsampling means that $s$ is much smaller than $n$ , the prior works did not lend any guarantee to SSN on high-dimensional data.

Appendix A Random Sampling for Matrix Approximation

Here, we give a short introduction to random sampling and their theoretical properties. Given a matrix ${\bf A}\in{\mathbb{R}}^{n\times d}$ , row selection constructs a smaller size matrix ${\bf C}\in{\mathbb{R}}^{s\times d}$ ( $s<n$ ) as an approximation of ${\bf A}$ . The rows of ${\bf C}$ is constructed using a randomly sampled and carefully scaled subset of the rows of ${\bf A}$ . Let $p_{1},\cdots,p_{n}\in(0,1)$ be the sampling probability associated with the rows of ${\bf A}$ . The rows of ${\bf C}$ is selected independently according to the sampling probability $\{p_{j}\}_{j=1}^{n}$ such that for all $j\in[n]$ , we have

[TABLE]

where ${\bf c}_{j}$ and ${\bf a}_{k}$ are the $j$ -th row of ${\bf C}$ and $k$ -th row of ${\bf A}$ . In a matrix multiplication form, ${\bf C}$ can be formed as

[TABLE]

where ${\bf S}\in{\mathbb{R}}^{{\bf s}\times d}$ is called the sketching matrix. As a result of row selection, there is only a non-zero entry in each column of ${\bf S}$ , whose position and value correspond to the sampled row of ${\bf A}$ .

Uniform sampling.

Uniform sampling simply sets all the sampling probabilities equal, i.e., $p_{1}=\cdots=p_{n}=\frac{1}{n}$ . Its corresponding sketching matrix ${\bf S}$ is often called uniform sampling matrix. The non-zero entry in each column of ${\bf S}$ is the same, i.e., $\sqrt{\frac{n}{s}}$ . If $s$ is sufficiently large,

[TABLE]

is a good approximation to ${\bf H}_{t}$ .

Lemma 10 (Uniform Sampling).

*Let ${\bf H}_{t}$ and $\widetilde{\bf H}_{t}$ be defined as that in (10) and (20). Denote ${d_{\text{eff}}^{\gamma}}={d_{\text{eff}}^{\gamma}}({\bf A}_{t}),\mu^{\gamma}=\mu^{\gamma}({\bf A}_{t})$ for simplicity. Given arbitrary error tolerance $\varepsilon\in(0,1)$ and failure probability $\delta\in(0,1)$ , when *

[TABLE]

*the spectral approximation holds with probability at least $1-\delta$ : *

[TABLE]

Proof.

The proof trivially follows from Cohen et al. (2015). ∎

Ridge leverage score sampling.

It takes $p_{j}$ proportional to the $j$ -th ridge leverage score, i.e.,

[TABLE]

where $l_{i}^{\gamma}$ is the ridge leverage score of the $i$ -th row of ${\bf A}$ . Let ${\bf U}$ be its sketching matrix. Then the non-zero entry in $j$ -th column of ${\bf U}$ is $\sqrt{\frac{1}{s\cdot p_{k}}}$ if the $j$ -th row of ${\bf U}^{T}{\bf A}$ is drawn from the $k$ -th row of ${\bf A}$ , where $p_{k}$ is defined as (21). If the ridge leverage score sampling is used to approximate the $d\times d$ Hessian matrix, the approximate Hessian matrix turns to

[TABLE]

The sample complexity in Theorems 1 and 2 will be improved to $s=\Theta\big{(}\tfrac{{d_{\text{eff}}^{\gamma}}}{\varepsilon^{2}}\log\tfrac{{d_{\text{eff}}^{\gamma}}}{\delta}\big{)}$ .

Lemma 11 (Ridge Leverage Rampling).

Let ${\bf H}_{t}$ and $\widetilde{\bf H}_{t}$ be defined as that in (10) and (22). Denote ${d_{\text{eff}}^{\gamma}}={d_{\text{eff}}^{\gamma}}({\bf A}_{t}),\mu^{\gamma}=\mu^{\gamma}({\bf A}_{t})$ for simplicity. Given arbitrary error tolerance $\varepsilon\in(0,1)$ and failure probability $\delta\in(0,1)$ , when

[TABLE]

the spectral approximation holds with probability at least $1-\delta$ :

[TABLE]

Proof.

The proof trivially follows from Cohen et al. (2015). ∎

Appendix B Convergence of Sub-Sampled Newton

In this section, we first give a framework of analyzing the recursion of $\mbox{\boldmath$ \Delta $\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*}$ , which also inspires the proofs for distributed Newton-type Method and SSPN. Within this simple framework, we then complete the proofs for the global and local convergence for SSN.

B.1 A analyzing framework

Approximate Newton Direction.

We can view the process of solving the newton direction ${\bf p}_{t}$ from the linear system

[TABLE]

as a convex optimization. Recalling that ${\bf A}_{t}$ is defined in (9), we define the quadratic auxiliary function

[TABLE]

Obviously, the true Newton direction ${\bf p}_{t}$ is the critical point of $\phi_{t}({\bf p})$ :

[TABLE]

Since we use subsampled Hessian $\widetilde{\bf H}_{t}$ , we solve the approximate Newton direction $\widetilde{\bf p}_{t}$ from (11) instead of (23), thus the counterpart of $\phi_{t}({\bf p})$ is defined

[TABLE]

It is easy to verify the approximate Newton direction $\widetilde{\bf p}_{t}$ is the minimizers of (26), i.e.,

[TABLE]

Lemma 12 shows that $\widetilde{\bf p}_{t}$ is close to ${\bf p}_{t}$ in terms of the value of $\phi(\cdot)$ , if the subsampled Hessian $\widetilde{\bf H}_{t}$ , which is used to establish the linear system $\widetilde{\bf p}_{t}$ satisfies, is a good spectral approximation of ${\bf H}_{t}$ .

Lemma 12 (Approximate Newton Direction).

*Assume $(1-\varepsilon){\bf H}_{t}\preceq\widetilde{\bf H}_{t}\preceq(1+\varepsilon){\bf H}_{t}$ holds already. Let $\phi_{t}({\bf p})$ , ${\bf p}_{t}^{*}$ and $\widetilde{\bf p}_{t}$ be defined respectively in (24), (25) and (27). It holds that *

[TABLE]

where $\alpha=\frac{\varepsilon}{1-\varepsilon}$ .

Proof.

Since we now analyze the approximate Newton direction locally, we leave out all subscript $t$ for simplicity. By the assumption that $(1-\varepsilon){\bf H}\preccurlyeq\widetilde{\bf H}\preccurlyeq(1+\varepsilon){\bf H}$ , we conclude that there must exist a symmetric matrix $\Gamma$ such that

[TABLE]

By the definition of ${\bf p}^{*}$ and $\tilde{{\bf p}}$ , we have

[TABLE]

where the second equation is the result of ${\bf A}^{-1}-{\bf B}^{-1}={\bf B}^{-1}({\bf B}-{\bf A}){\bf A}^{-1}$ for nonsingular matrixs ${\bf A}$ and ${\bf B}$ . The last equation holds since ${\bf H}^{\frac{1}{2}}{\bf p}^{*}={\bf H}^{-\frac{1}{2}}{\bf g}$ .

It follows that

[TABLE]

where the third inequality follows from $\big{\|}\mbox{\boldmath$ \Gamma $\unboldmath}\big{\|}\leq\frac{\varepsilon}{1-\varepsilon}$ and the last inequality holds due to $\big{\|}\mbox{\boldmath$ \Omega $\unboldmath}\big{\|}\leq\varepsilon$ .

Thus it follows from $\phi({\bf p}^{*})=-\big{\|}{\bf H}^{\frac{1}{2}}{\bf p}^{*}\big{\|}_{2}^{2}$ and the definition of $\phi(\widetilde{\bf p})$ that

[TABLE]

Combining (LABEL:eq:snn_lem_upperbound2) and (28), we have that

[TABLE]

where $\alpha=\frac{\varepsilon}{1-\varepsilon}$ . Then the lemme follows. ∎

Approximate Newton Step.

If $\widetilde{\bf p}_{t}$ is very close to ${\bf p}_{t}^{*}$ (in terms of the value of the auxiliary function $\phi_{t}(\cdot)$ ), then the direction $\widetilde{\bf p}_{t}$ , along which the parameter ${\bf w}_{t}$ will descend, can be considered provably as a good along direction. Provided that $\widetilde{\bf p}_{t}$ is a good descending direction, Lemma 13 establishes the recursion of $\mbox{\boldmath$ \Delta $\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*}$ after one step of direction descend.

Lemma 13 (Approximate Newton Step).

*Let Assumption (1) (i.e., the Hessian matrix is $L$ -Lipschitz) hold. Let $\alpha\in(0,1)$ be any fixed error tolerance. If $\tilde{{\bf p}}_{t}$ satisfies *

[TABLE]

*Then $\mbox{\boldmath$ \Delta $\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*}$ satisfies *

[TABLE]

Proof.

See the proof of Lemma 9 in Shusen Wang et al. (2018). ∎

Error Recursion.

By combining all the lemmas above, we can analyze the recursion of $\mbox{\boldmath$ \Delta $\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*}$ for SSN.

Let ${d_{\text{eff}}^{\gamma}}$ and $\mu^{\gamma}$ respectively be the $\gamma$ -ridge leverage score and $\gamma$ -coherence of ${\bf A}_{t}$ . From Lemma 10, when $s=\Theta\left(\frac{\mu^{\gamma}{d_{\text{eff}}^{\gamma}}}{\varepsilon^{2}}\log\frac{{d_{\text{eff}}^{\gamma}}}{\delta}\right)$ , $\widetilde{\bf H}_{t}$ is a $\varepsilon$ spectral approximation of ${\bf H}_{t}$ . By Lemma 12, the approximate Newton direction $\widetilde{\bf p}_{t}$ , solved from the linear system $\widetilde{\bf H}_{t}{\bf p}={\bf g}_{t}$ , is not far from ${\bf p}_{t}$ in terms of the value of $\phi(\cdot)$ with $\alpha=\frac{\varepsilon}{1-\varepsilon}$ . It then follows from Lemma 13 that

[TABLE]

which establishes the recursion of $\mbox{\boldmath$ \Delta $\unboldmath}_{t}$ for SSN.

B.2 Proof of SNN for quadratic loss

Proof of Theorem 1.

Since the loss is quadratic, w.l.o.g, let ${\bf H}_{t}\equiv{\bf H}$ and ${\bf A}_{t}\equiv{\bf A}$ . Let ${d_{\text{eff}}^{\gamma}}$ and $\mu^{\gamma}$ respectively be the $\gamma$ -ridge leverage score and $\gamma$ -coherence of ${\bf A}$ , and $\kappa$ be the condition number of ${\bf H}$ . Note that $L\equiv 0$ due to the quadratic loss. Let $\beta=\frac{\alpha}{\sqrt{1-\alpha^{2}}}$ . Since $\varepsilon\leq\frac{1}{4}$ , then $\beta\leq\sqrt{2}\varepsilon$ .

From the last part of the analysis in B.1, $\mbox{\boldmath$ \Delta $\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*}$ satisfies the error recursion inequality (30) with $L=0$ , i.e.,

[TABLE]

By recursion, it follows that

[TABLE]

Then the theorem follows. ∎

B.3 Proof of SNN for non-quadratic loss

Proof of Theorem 2.

Let ${d_{\text{eff}}^{\gamma}}={d_{\text{eff}}^{\gamma}}({\bf A}_{t}),\mu^{\gamma}=\mu^{\gamma}({\bf A}_{t})$ , $\alpha=\frac{\varepsilon}{1-\varepsilon}$ and $\beta=\frac{\alpha}{\sqrt{1-\alpha^{2}}}$ . Since $\varepsilon\leq\frac{1}{4}$ , then $\beta\leq\sqrt{2}\varepsilon$ .

From the last part of the analysis in B.1, $\mbox{\boldmath$ \Delta $\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*}$ satisfies the error recursion inequality (30), i.e.,

[TABLE]

Let $\kappa_{t}=\tfrac{\sigma_{\max}({\bf H}_{t})}{\sigma_{\min}({\bf H}_{t})}$ is the condition number. By plugging ${\bf H}_{t}\preceq\sigma_{\max}({\bf H}_{t}){\bf I}_{d}$ and $\sigma_{\min}({\bf H}_{t}){\bf I}_{d}\preceq{\bf H}_{t}$ into (30), it follows that

[TABLE]

Vewing (31) as a one-variable quadratic inequality about $\big{\|}\mbox{\boldmath$ \Delta $\unboldmath}_{t+1}\big{\|}_{2}$ and solving it, we have

[TABLE]

where the second inequality follows from $\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}$ for $a,b\geq 0$ and the last inequality holds by reorganizations and the fact $\beta\leq\sqrt{2}\varepsilon$ . Then the theorem follows. ∎

B.4 Proof of Theorem 3

Theorem 3 can be proved in the same way as Theorems 1 and 2; the only difference is using Lemma 11 instead of Lemma 10

Appendix C Convergence of GIANT

We still use the framework described in Appendix B.1 to prove the results for GIANT. But two modification should be made in the proof. The first one lies in the part of analyzing Uniform Sampling, since data are distributed and only accessible locally and subsampled Hessians are constructed locally. We can prove each worker machine can simultaneously obtain a $\varepsilon$ spectral approximation of the Hessian matrix. The second one lies in the part of analyzing Approximate Newton Direction, since GIANT uses the global Newton direction, which is the average of all local ANT directions, to update parameters. We can prove the global Newton direction is still a good descending direction. Once above two modifications are solid established, we prove the main theorems for GIANT.

C.1 Two modifications

Simultaneous Uniform Sampling.

We can assume these $s$ samples in each worker machine are randomly draw from $\{({\bf x}_{i},l_{i})\}_{i=1}^{n}$ . This assumption is reasonable because if the samples are i.i.d. drawn from some distribution, then a data-independent partition can be viewed as uniformly sampling equivalently.

Recall that ${\bf A}_{t,i}\in{\mathbb{R}}^{s\times d}$ contains the rows of ${\bf A}_{t}$ selected by $i$ -th work machine in iteration $t$ . Let ${\bf S}_{i}\in{\mathbb{R}}^{n\times s}$ be the associated uniform sampling matrix with each column only one non-zero number $\sqrt{\frac{n}{s}}$ . Then ${\bf A}_{t,i}=\sqrt{\frac{s}{n}}{\bf S}_{i}^{T}{\bf A}_{t}$ . The $i$ -th local subsampled Hessian matrix is formed as

[TABLE]

Lemma 14 (Simultaneous Uniform Sampling).

*Let $\varepsilon,\delta\in(0,1)$ be fixed parameters. Let ${\bf H}_{t},\widetilde{\bf H}_{t,i}$ be defined in (10) and (32). Denote ${d_{\text{eff}}^{\gamma}}={d_{\text{eff}}^{\gamma}}({\bf A}_{t}),\mu^{\gamma}=\mu^{\gamma}({\bf A}_{t})$ for simplicity. Then when *

[TABLE]

*with probability at least $1-\delta$ , the spectral approximation holds simultaneously, i.e., *

[TABLE]

Proof.

Since we analyze each local ${\bf A}_{t}$ , we leave out the subscribe $t$ for simplicity.

By Lemma 10, we know with probability $1-\frac{\delta}{m}$ , when $s=\Theta\left(\frac{\mu^{\gamma}{d_{\text{eff}}^{\gamma}}}{\varepsilon^{2}}\log\frac{{d_{\text{eff}}^{\gamma}}m}{\delta}\right)$ , it follows that ,

[TABLE]

By Bonferroni’s method, we know with probability $1-\delta$ , the spectral approximation holds simultaneously. ∎

Global Approximate Newton Direction

Recall that the gradient at iteration $t$ is ${\bf g}_{t}=\nabla F({\bf w}_{t})$ . The local ANT computed by $i$ -th worker machine is $\widetilde{\bf p}_{t,i}=\widetilde{\bf H}_{t,i}^{-1}{\bf g}_{t}$ . The global Newton direction is formed as

[TABLE]

where $\widetilde{\bf H}_{t}$ is defined as the harmonic mean of $\widetilde{\bf H}_{t,i}$ , i.e.,

[TABLE]

Lemma 15 (Model average).

Assume condition (33) holds for given $\varepsilon,\delta\in(0,1)$ . Let $\phi_{t}({\bf p})$ and $\widetilde{\bf p}_{t}$ be defined respectively in (24) and (34). It holds that

[TABLE]

where $\alpha=\frac{\varepsilon^{2}}{1-\varepsilon}$ .

Proof.

We leave out all subscript $t$ for simplicity. It follows from condition (33) that there must exist a symmetric matrix $\mbox{\boldmath$ \Gamma $\unboldmath}_{i}$ such that

[TABLE]

By the definition of $\widetilde{\bf p}_{i}$ and ${\bf p}^{*}$ , we have

[TABLE]

where the second equation is the result of ${\bf A}^{-1}-{\bf B}^{-1}={\bf B}^{-1}({\bf B}-{\bf A}){\bf A}^{-1}$ for nonsingular matrixs ${\bf A}$ and ${\bf B}$ and the last equation holds since ${\bf H}^{\frac{1}{2}}{\bf p}^{*}={\bf H}^{-\frac{1}{2}}{\bf g}$ .

It follows that

[TABLE]

It follows from the assumption (33) that

[TABLE]

Let ${\bf S}=\frac{1}{\sqrt{m}}[{\bf S}_{1},\cdots,{\bf S}_{m}]$ be the concatenation of ${\bf S}_{1},\cdots,{\bf S}_{m}$ . Then ${\bf S}\in{\mathbb{R}}^{n\times ms}$ is a uniform sampling matrix which samples $n=ms$ rows. Actually, ${\bf S}$ is a permutation matrix, with every row and column containing precisely a single 1 with 0s everywhere else. Therefore,

[TABLE]

It follows from (35), (36) and (37) that

[TABLE]

By the definition of $\phi({\bf p})$ and (38), it follows that

[TABLE]

where $\alpha=\frac{\varepsilon^{2}}{1-\varepsilon}$ . Then the lemme follows from $\phi({\bf p}^{*})=-\big{\|}{\bf H}^{\frac{1}{2}}{\bf p}^{*}\big{\|}_{2}^{2}$ . ∎

Error Recursion.

Plugging above two modifications into the analysis framework described in Appendix B.1, we can analyze the recursion of $\mbox{\boldmath$ \Delta $\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*}$ for GIANT.

Let ${d_{\text{eff}}^{\gamma}}$ and $\mu^{\gamma}$ respectively be the $\gamma$ -ridge leverage score and $\gamma$ -coherence of ${\bf A}_{t}$ . From Lemma 14, when $s=\Theta\left(\frac{\mu^{\gamma}{d_{\text{eff}}^{\gamma}}}{\varepsilon^{2}}\log\frac{{d_{\text{eff}}^{\gamma}}m}{\delta}\right)$ , for each $i\in[m]$ , $\widetilde{\bf H}_{t,i}$ is a $\varepsilon$ spectral approximation of ${\bf H}_{t}$ . By Lemma 15, the global ANT $\widetilde{\bf p}_{t}$ , an average of all local APTs $\widetilde{\bf p}_{t,i}$ , is not far from ${\bf p}_{t}$ in terms of the value of $\phi(\cdot)$ with $\alpha=\frac{\varepsilon^{2}}{1-\varepsilon}$ . It then follows from Lemma 13 that (30) still holds but with $\alpha=\frac{\varepsilon^{2}}{1-\varepsilon}$ , i.e.,

[TABLE]

which establishes the recursion of $\mbox{\boldmath$ \Delta $\unboldmath}_{t}$ for GIANT.

C.2 Proof of GIANT for quadratic loss

Proof of Theorem 4.

Since the loss is quadratic, w.l.o.g, let ${\bf H}_{t}\equiv{\bf H}$ and ${\bf A}_{t}\equiv{\bf A}$ . Let ${d_{\text{eff}}^{\gamma}}$ and $\mu^{\gamma}$ respectively be the $\gamma$ -ridge leverage score and $\gamma$ -coherence of ${\bf A}$ , and $\kappa$ be the condition number of ${\bf H}$ . Note that $L\equiv 0$ due to the quadratic loss. Let $\beta=\frac{\alpha}{\sqrt{1-\alpha^{2}}}$ . Since $\varepsilon\leq\frac{1}{2}$ , then $\beta\leq 3\varepsilon^{2}$ .

From the last part of the analysis in C.1, $\mbox{\boldmath$ \Delta $\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*}$ satisfies the error recursion inequality (30) with $L=0$ , i.e.,

[TABLE]

By recursion, it follows that

[TABLE]

Then the theorem follows. ∎

C.3 Proof of GIANT for non-quadratic loss

Proof of Theorem 5.

Let ${d_{\text{eff}}^{\gamma}}={d_{\text{eff}}^{\gamma}}({\bf A}_{t}),\mu^{\gamma}=\mu^{\gamma}({\bf A}_{t})$ , $\alpha=\frac{\varepsilon}{1-\varepsilon}$ and $\beta=\frac{\alpha}{\sqrt{1-\alpha^{2}}}$ . Since $\varepsilon\leq\frac{1}{2}$ , then $\beta\leq 3\varepsilon^{2}$ .

From the last part of the analysis in C.1, $\mbox{\boldmath$ \Delta $\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*}$ satisfies the error recursion inequality (30), i.e.,

[TABLE]

Let $\kappa_{t}=\tfrac{\sigma_{\max}({\bf H}_{t})}{\sigma_{\min}({\bf H}_{t})}$ is the condition number. By plugging ${\bf H}_{t}\preceq\sigma_{\max}({\bf H}_{t}){\bf I}_{d}$ and $\sigma_{\min}({\bf H}_{t}){\bf I}_{d}\preceq{\bf H}_{t}$ into (30), we can obtain a one-variable quadratic inequality about $\big{\|}\mbox{\boldmath$ \Delta $\unboldmath}_{t+1}\big{\|}_{2}$ , which is almost the same form as (31) except the value of $\beta$ . Solving it, we have

[TABLE]

Then the theorem follows from $\beta\leq 3\varepsilon^{2}$ . ∎

Appendix D Convergence of SSPN

Since the proximal mapping of the non-smooth part $r(\cdot)$ is used, rather than direct gradient descend, the analysis of Approximate Newton Step should be modified. We first introduce two properties of proximal mapping, and then provides the error recursion of $\mbox{\boldmath$ \Delta $\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*}$ for SSPN. Based on that error recursion, we can prove the global convergence for quadratic loss and the local convergence for non-quadratic loss for SSPN.

D.1 Proximal mapping

The definition of the proximal operater is merely for theoretical analysis. So we move it to the appendix. The proximal mapping is defined as

[TABLE]

which involve the current point ${\bf w}$ , the convex (perhaps non-smooth) function $r(\cdot)$ , and the SPSD precondition matrix ${\bf Q}$ . The update rule of SSPN is:

[TABLE]

SSN can also be written in this form, with $r(\cdot)=0$ .

The proximal mapping enjoys the nonexpansiveness property and fixed point property Lee et al. (2014).

Lemma 16 (Nonexpansiveness).

*Let $r:{\mathbb{R}}^{d}\rightarrow{\mathbb{R}}$ be a convex function and ${\bf H}\in{\mathbb{R}}^{d\times d}$ be a SPSD matrix. The scaled proximal mapping is nonexpansive, i.e., for all ${\bf w}_{1}$ and and ${\bf w}_{2}$ , *

[TABLE]

Lemma 17 (Fixed Point Property of Minimizers).

*Let $L:{\mathbb{R}}^{d}\rightarrow{\mathbb{R}}$ be convex and twice differential and $r:{\mathbb{R}}^{d}\rightarrow{\mathbb{R}}$ be convex. Let ${\bf g}^{*}$ be the gradient of $L$ at ${\bf w}^{*}$ , i.e., ${\bf g}^{*}=\nabla L({\bf w}_{t})$ . Then for any SPSD matrix ${\bf H}$ , ${\bf w}^{*}$ minimizes $F({\bf w})=L({\bf w})+r({\bf w})$ if and only if *

[TABLE]

D.2 Analysis of Approximate Newton Step

Lemma 18 (Error Recursion).

*Assume $(1-\varepsilon){\bf H}_{t}\preceq\widetilde{\bf H}_{t}\preceq(1+\varepsilon){\bf H}_{t}$ and Assumption 1 (i.e., the Hessian Lipschitz continuity) hold. Let $\mbox{\boldmath$ \Delta $\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*}$ , we have *

[TABLE]

Proof.

Recall that the updating rule is

[TABLE]

Let ${\bf w}^{*}$ be the minimizer and ${\bf g}^{*}=\nabla F({\bf w}^{*})$ be the its gradient of the smooth part of the objective (i.e., $F({\bf w})=\frac{1}{n}\sum_{j=1}^{n}l_{j}({\bf x}_{j}^{T}{\bf w})+\frac{\gamma}{2}\big{\|}{\bf w}\big{\|}_{2}^{2}$ ). It follows that

[TABLE]

where the first equality results from Lemma 17 and the first inequality results from Lemma 16.

Let $\mbox{\boldmath$ \Delta $\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*}$ and ${\bf Z}_{t}={\bf H}_{t}\mbox{\boldmath$ \Delta $\unboldmath}_{t}-({\bf g}_{t}-{\bf g}^{*})$ for short. Then

[TABLE]

The first inequality is a rephrasing of (40). The second inequality follows from the triangle inequality and the definition of ${\bf Z}_{t}$ . The third inequality holds due to $\widetilde{\bf H}_{t}^{-1}\preceq\frac{1}{1-\varepsilon}{\bf H}_{t}^{-1}$ and $\widetilde{\bf H}_{t}-{\bf H}_{t}\preceq\varepsilon{\bf H}_{t}$ .

Next we upper bound $\big{\|}{\bf Z}_{t}\big{\|}_{2}$ . It follows that

[TABLE]

The second line holds due to

[TABLE]

The third line follows from Cauchy inequality and the definition ${\bf H}_{t}={\bf H}({\bf w}_{t})$ . The last inequality holds due to the Hessian Lipschitz continuity (Assumption 1).

Note that $\big{\|}\mbox{\boldmath$ \Delta $\unboldmath}_{t+1}\big{\|}_{{\bf H}_{t}}\leq\frac{1}{\sqrt{1-\varepsilon}}\big{\|}\mbox{\boldmath$ \Delta $\unboldmath}_{t+1}\big{\|}_{\widetilde{\bf H}_{t}}.$ Thus the lemma follows from this equality, (41) and (42), i.e.,

[TABLE]

∎

D.3 Proof of SSPN for quadratic loss

Theorem 19 (Formal statement of Theorem 6 for quadratic loss).

*Let ${d_{\text{eff}}^{\gamma}}$ and $\mu^{\gamma}$ respectively be the $\gamma$ -ridge leverage score and $\gamma$ -coherence of ${\bf A}$ , and $\kappa$ be the condition number of ${\bf H}$ . Let $\varepsilon\in(0,\frac{1}{4})$ and $\delta\in(0,1)$ be any user-specified constants. Assume each loss function $l_{i}(\cdot)$ is quadratic. For a sufficiently large sub-sample size: *

[TABLE]

*with probability at least $1-\delta$ , *

[TABLE]

Proof of Theorem 19.

Under the same condition as Theorem 1, by Lemma 18, it follows that

[TABLE]

Since the loss is quadratic then $L\equiv 0$ . Let ${\bf H}_{t}\equiv{\bf H}$ and ${\bf A}_{t}\equiv{\bf A}$ . Let ${d_{\text{eff}}^{\gamma}}$ and $\mu^{\gamma}$ respectively be the $\gamma$ -ridge leverage score and $\gamma$ -coherence of ${\bf A}$ , and $\kappa$ be the condition number of ${\bf H}$ . Then we have

[TABLE]

By recursion, it follows that

[TABLE]

where $\beta=\frac{\varepsilon}{1-\varepsilon}$ and $\kappa=\frac{\sigma_{\max}({\bf H})}{\sigma_{\min}({\bf H})}$ is the conditional number. ∎

D.4 Proof of SSPN for non-quadratic loss

Theorem 20 (Formal statement of Theorem 6 for non-quadratic loss).

*Let ${d_{\text{eff}}^{\gamma}},\mu^{\gamma}$ respectively be the $\gamma$ -ridge leverage score and $\gamma$ -coherence of ${\bf A}_{t}$ . Let $\varepsilon\in(0,\frac{1}{4})$ and $\delta\in(0,1)$ be any user-specified constants. Let Assumption 1 be satisfied. For a sufficiently large sub-sample size: *

[TABLE]

*with probability at least $1-\delta$ , *

[TABLE]

where $\kappa_{t}=\tfrac{\sigma_{\max}({\bf H}_{t})}{\sigma_{\min}({\bf H}_{t})}$ is the condition number.

Proof of Theorem 20.

By plugging $\sigma_{\min}({\bf H}_{t}){\bf I}_{d}\preceq{\bf H}_{t}\preceq\sigma_{\max}({\bf H}_{t}){\bf I}_{d}$ into the result of Lemma 18, it follows that

[TABLE]

where $\kappa_{t}=\tfrac{\sigma_{\max}({\bf H}_{t})}{\sigma_{\min}({\bf H}_{t})}$ is the condition number.

Since $\varepsilon\leq\frac{1}{2}$ , $\frac{1}{1-\varepsilon}$ is bounded by 2. Thus we have

[TABLE]

which proves this theorem. ∎

Appendix E Inexact Solution to Sub-Problems

The computation complexity can be alleviated when the Conjugate Gradient (CG) method is used to compute the inexact Newton step. This methodology has been discussed and practiced before Shusen Wang et al. (2018). In this section, we prove that SSN and GIANT can benefit from inexact Newton step. What’s more, we provide an theoretical bound for inexact solution for SSPN.

The framework described in Appendix B.1 can help us to prove results for SSN and GIANT. Since CG produces an approximate solution for the linear system which the approximate Newton direction satisfies, the analysis of Approximate Newton Direction in Appendix B.1 should be modified. We can prove that when the inexact solution satisfies the particular stopping condition, it is close to the exact Newton direction ${\bf p}_{t}$ in terms of the value of $\phi_{t}(\cdot)$ .

E.1 Inexactly solving for SSN

In the $t$ -th iteration, the exact solution is $\widetilde{\bf p}_{t}=\widetilde{\bf H}_{t}^{-1}{\bf g}_{t}$ , where $\widetilde{\bf H}_{t}$ is the subsampled Hessian defined in (10). Let $\widetilde{\bf p}_{t}^{\prime}$ be the inexact solution produced by CG. It satisfies the stopping condition (18), i.e.,

[TABLE]

Thus SSN takes inexact Newton direction $\widetilde{\bf p}_{t}^{\prime}$ to update the parameter ${\bf w}_{t}$ instead of $\widetilde{\bf p}_{t}$ .

Lemma 21 (Inexact solution for SSN).

For given $\varepsilon\in(0,1)$ , assume $(1-\varepsilon){\bf H}_{t}\preceq\widetilde{\bf H}_{t}\preceq(1+\varepsilon){\bf H}_{t}$ holds. Let $\phi_{t}({\bf p})$ be defined in (24). Let $\widetilde{\bf p}_{t}^{\prime}$ be the inexact solution satisfying (18). Then it holds that

[TABLE]

where $\alpha_{0}=\frac{\varepsilon_{0}+\varepsilon}{1-\varepsilon-\varepsilon_{0}}$ .

Proof.

We leave out the subscript $t$ for simplicity.

It follows from the stopping condition (18) that

[TABLE]

Since ${\bf H}^{\frac{1}{2}}{\bf p}^{*}={\bf H}^{-\frac{1}{2}}{\bf g}$ , it follows that

[TABLE]

Then it follows that

[TABLE]

where the last inequality is due to (28) (which results from Lemma 12).

By the definition of $\phi({\bf p})$ and (43), it follows that

[TABLE]

where $\alpha_{0}=\frac{\varepsilon_{0}+\varepsilon}{1-\varepsilon-\varepsilon_{0}}$ . ∎

E.2 Inexactly solving for SSPN

When the inexact solution is used, the update rule of SSPN becomes:

[TABLE]

where $\widetilde{\bf p}_{t}^{\prime}$ is the inexact solution. CG produces $\widetilde{\bf p}_{t}^{\prime}$ via inexactly solving the linear system $\widetilde{\bf H}_{t}{\bf p}={\bf g}_{t}$ with stopping condition (18), which is equivalent to

[TABLE]

Lemma 22 (Inexact solution for GIANT).

*Assume $(1-\varepsilon){\bf H}_{t}\preceq\widetilde{\bf H}_{t}\preceq(1+\varepsilon){\bf H}_{t}$ and Assumption 1 (i.e., the Hessian Lipschitz continuity) hold. Assume $\varepsilon\leq\frac{1}{2}$ . Let $\mbox{\boldmath$ \Delta $\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*}$ . Let $\widetilde{\bf p}_{t}^{\prime}$ be an approximation to the SSPN direction $\widetilde{\bf p}_{t}$ and ${\bf w}_{t+1}^{\prime}={\bf w}_{t}-\widetilde{\bf p}_{t}^{\prime}$ , we have *

[TABLE]

where $\varepsilon_{1}=2\varepsilon+\varepsilon_{0}$ .

Proof.

By the updating rules ${\bf w}_{t+1}={\bf w}_{t}-\widetilde{\bf p}_{t}$ and ${\bf w}_{t+1}^{\prime}={\bf w}_{t}-\widetilde{\bf p}_{t}^{\prime}$ , we obtain

[TABLE]

It follows from (18) that

[TABLE]

and thus

[TABLE]

where $\Delta_{t+1}\triangleq{\bf w}_{t}-{\bf w}^{\star}$ .

Since $(1-\varepsilon){\bf H}_{t}\preceq\widetilde{\bf H}_{t}\preceq(1+\varepsilon){\bf H}_{t}$ , it follows that

[TABLE]

It follows from the bound on $\Delta_{t+1}$ (Lemma 18) that

[TABLE]

Thus we can obtain

[TABLE]

where $A_{\varepsilon,\varepsilon_{0}}$ and $B_{\varepsilon,\varepsilon_{0}}$ are some function of $\varepsilon$ and $\varepsilon_{0}$ satisfying

[TABLE]

Since $\varepsilon\leq\frac{1}{2}$ , it follows that

[TABLE]

Then the lemma follows. ∎

Proof of Corollary 9.

Similar to Lemma 18, Lemma 21 show that when the inexact SSPN direction $\widetilde{\bf p}_{t}^{\prime}$ is used, ${\bf w}_{t+1}^{\prime}={\bf w}_{t}-\widetilde{\bf p}_{t}^{\prime}$ still quadratic-linearly converges. The only difference in their conclusions is that $\varepsilon$ in Lemma 18 now is changed into $\varepsilon_{1}=2\varepsilon+\varepsilon_{0}$ in Lemma 21. In the proof of Theorem 19 and 20, we can replace Lemma 18 with Lemma 21, then results still hold for inexactly solving for SSPN, except that the value of $\varepsilon$ is changed into $\varepsilon_{1}=2\varepsilon+\varepsilon_{0}$ .

Since we can always determine what $\varepsilon$ to choose in advance, we can use a $\frac{\varepsilon}{2}$ spectral approximation of ${\bf H}_{t}$ (which will slightly change the value of $s$ but will not change its order), thus $\varepsilon_{1}$ will become $\varepsilon+\varepsilon_{0}$ as Corollary 9 states. Therefore we prove Corollary 9. ∎

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Alaoui and Mahoney [2015] Ahmed Alaoui and Michael W Mahoney. Fast Randomized Kernel Ridge Regression with Statistical Guarantees. In Advances in Neural Information Processing Systems (NIPS) . 2015.
2Berahas et al. [2017] Albert S Berahas, Raghu Bollapragada, and Jorge Nocedal. An investigation of Newton-sketch and subsampled Newton methods. ar Xiv preprint ar Xiv:1705.06211 , 2017.
3Bubeck [2014] Sébastien Bubeck. Theory of convex optimization for machine learning. ar Xiv preprint ar Xiv:1405.4980 , 15, 2014.
4Byrd et al. [2011] Richard H Byrd, Gillian M Chin, Will Neveitt, and Jorge Nocedal. On the use of stochastic hessian information in optimization methods for machine learning. SIAM Journal on Optimization , 21(3):977–995, 2011.
5Candès and Recht [2009] Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics , 9(6):717, 2009.
6Candes et al. [2006] Emmanuel J Candes, Justin K Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences , 59(8):1207–1223, 2006.
7Candès et al. [2011] Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM) , 58(3):11, 2011.
8Cohen et al. [2015] Michael B Cohen, Cameron Musco, and Christopher Musco. Ridge leverage scores for low-rank approximation. ar Xiv preprint ar Xiv:1511.07263 , 6, 2015.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Do Subsampled Newton Methods Work for High-Dimensional Data?

Abstract

1 Introduction

1.1 Our contributions

2 Notation and Preliminary

Basic matrix notation.

Ridge leverage scores.

Effective dimension.

Ridge coherence.

Gradient and Hessian.

3 Sub-Sampled Newton (SSN)

3.1 Algorithm description

3.2 Our improved convergence bounds

Improved bound for quadratic loss.

Theorem 1** (Global Convergence).**

Proof.

Improved bound for non-quadratic loss.

Assumption 1**.**

Theorem 2** (Local Convergence).**

Proof.

Theorem 3**.**

Remark 1**.**

3.3 Comparison with prior work

4 Distributed Newton-Type Method

4.1 Motivation and algorithm description

4.2 Our improved convergence bounds

Improved bound for quadratic loss.

Theorem 4** (Global Convergence).**

Proof.

Improved bound for non-quadratic loss.

Theorem 5** (Local Convergence).**

Proof.

Remark 2**.**

4.3 Comparison with prior work

5 Sub-Sampled Proximal Newton (SSPN)

5.1 Algorithm Description

5.2 Our improved error convergence bounds

Theorem 6**.**

Proof.

5.3 Comparison with prior work

6 Inexactly Solving the Subproblems

Corollary 7** (SSN).**

Proof.

Corollary 8** (GIANT).**

Proof.

Corollary 9** (SSPN).**

Proof.

7 Conclusion

Appendix A Random Sampling for Matrix Approximation

Uniform sampling.

Lemma 10** (Uniform Sampling).**

Proof.

Ridge leverage score sampling.

Lemma 11** (Ridge Leverage Rampling).**

Proof.

Appendix B Convergence of Sub-Sampled Newton

B.1 A analyzing framework

Approximate Newton Direction.

Lemma 12** (Approximate Newton Direction).**

Proof.

Approximate Newton Step.

Lemma 13** (Approximate Newton Step).**

Proof.

Error Recursion.

B.2 Proof of SNN for quadratic loss

Proof of Theorem 1.

B.3 Proof of SNN for non-quadratic loss

Proof of Theorem 2.

B.4 Proof of Theorem 3

Appendix C Convergence of GIANT

C.1 Two modifications

Simultaneous Uniform Sampling.

Lemma 14** (Simultaneous Uniform Sampling).**

Theorem 1 (Global Convergence).

Assumption 1.

Theorem 2 (Local Convergence).

Theorem 3.

Remark 1.

Theorem 4 (Global Convergence).

Theorem 5 (Local Convergence).

Remark 2.

Theorem 6.

Corollary 7 (SSN).

Corollary 8 (GIANT).

Corollary 9 (SSPN).

Lemma 10 (Uniform Sampling).

Lemma 11 (Ridge Leverage Rampling).

Lemma 12 (Approximate Newton Direction).

Lemma 13 (Approximate Newton Step).

Lemma 14 (Simultaneous Uniform Sampling).

Lemma 15 (Model average).

Lemma 16 (Nonexpansiveness).

Lemma 17 (Fixed Point Property of Minimizers).

Lemma 18 (Error Recursion).

Theorem 19 (Formal statement of Theorem 6 for quadratic loss).

Theorem 20 (Formal statement of Theorem 6 for non-quadratic loss).

Lemma 21 (Inexact solution for SSN).

Lemma 22 (Inexact solution for GIANT).