Do Subsampled Newton Methods Work for High-Dimensional Data?
Xiang Li, Shusen Wang, Zhihua Zhang

TL;DR
This paper provides a theoretical justification for the effectiveness of subsampled Newton methods in high-dimensional settings, showing they require significantly fewer samples than previously thought, especially when leveraging ridge leverage scores.
Contribution
It proves that only a small number of samples based on ridge leverage scores are needed for Hessian approximation in high-dimensional data, extending applicability to distributed and non-smooth problems.
Findings
Subsampled Newton methods need only ^ff_ff samples for Hessian approximation.
The method is effective even when data dimension is large and comparable to data size.
Extensions to distributed and non-smooth regularized optimization problems are provided.
Abstract
Subsampled Newton methods approximate Hessian matrices through subsampling techniques, alleviating the cost of forming Hessian matrices but using sufficient curvature information. However, previous results require samples to approximate Hessians, where is the dimension of data points, making it less practically feasible for high-dimensional data. The situation is deteriorated when is comparably as large as the number of data points , which requires to take the whole dataset into account, making subsampling useless. This paper theoretically justifies the effectiveness of subsampled Newton methods on high dimensional data. Specifically, we prove only samples are needed in the approximation of Hessian matrices, where is the -ridge leverage and can be much smaller than as long as $n\gamma \gg…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods
Do Subsampled Newton Methods Work for High-Dimensional Data?
Xiang Li1
Shusen Wang2
Zhihua Zhang1 1School of Mathematical Sciences, Peking University, China
2Department of Computer Science, Stevens Institute of Technology, USA
[email protected], [email protected], [email protected]
Abstract
Subsampled Newton methods approximate Hessian matrices through subsampling techniques, alleviating the cost of forming Hessian matrices but using sufficient curvature information. However, previous results require samples to approximate Hessians, where is the dimension of data points, making it less practically feasible for high-dimensional data. The situation is deteriorated when is comparably as large as the number of data points , which requires to take the whole dataset into account, making subsampling useless. This paper theoretically justifies the effectiveness of subsampled Newton methods on convex empirical risk minimization with high dimensional data. Specifically, we provably need only samples the approximation of Hessian matrices, where is the -ridge leverage and can be much smaller than as long as . Additionally, we extend this result so that subsampled Newton methods can work for high-dimensional data on both distributed optimization problems and non-smooth regularized problems.
1 Introduction
Let be the feature vectors, is a convex, smooth, and twice differentiable loss function; the response is captured by . In this paper, we study the following optimization problem:
[TABLE]
where is a non-smooth convex function. We first consider the simple case where is zero, i.e.,
[TABLE]
Such a convex optimization problem (2) arises frequently in machining learning Shalev Shwartz and Ben David (2014). For example, in logistic regression, , and in linear regression, . Then we consider the more general case where is non-zero, e.g., LASSO Tibshirani (1996) and elastic net Zou and Hastie (2005).
To solve (2), many first order methods have been proposed. First-order methods solely exploit information in the objective function and its gradient. Accelerated gradient descent Golub and Van Loan (2012); Nesterov (2013); Bubeck (2014), stochastic gradient descent Robbins and Monro (1985), and their variants Lin et al. (2015); Johnson and Zhang (2013); Schmidt et al. are the most popular approaches in practice due to their simplicity and low per-iteration time complexity. As pointed out by Xu et al. (2017), the downsides of first-order methods are the slow convergence to high-precision and the sensitivity to condition number and hyper-parameters.
Second-order methods use not only the gradient but also information in the Hessian matrix in their update. In particular, the Newton’s method, a canonical second-order method, has the following update rule:
[TABLE]
where the gradient is the first derivative of the objective function at , the Hessian is the second derivative at , and is the step size and can be safely set as one under certain conditions. Compared to the first-order methods, Newton’s method requires fewer iterations and are more robust to the hyper-parameter setting, and guaranteed super-linear local convergence to high-precision. However, Newton’s method is slow in practice, as in each iteration many Hessian-vector products are required to solve the inverse problem . Quasi-Newton methods use information from the history of updates to construct Hessian Dennis and Moré (1977). Well-known works include Broyden-Fletcher-Goldfarb-Shanno (BFGS) Wright and Nocedal (1999) and its limited memory version (L-BFGS) Liu and Nocedal (1989), but their convergence rates are not comparable to Newton’s method.
Recent works proposed the Sub-Sampled Newton (SSN) methods to reduce the per-iteration complexity of the Newton’s method Byrd et al. (2011); Pilanci and Wainwright (2015); Roosta Khorasani and Mahoney (2016); Pilanci and Wainwright (2017); Xu et al. (2017); Berahas et al. (2017); Ye et al. (2017). For the particular problem (2), the Hessian matrix can be written in the form
[TABLE]
for some matrix whose -th row is a scaling of . The basic idea of SSN is to sample and scale () rows of to form and approximate by
[TABLE]
The quality of Hessian approximation is guaranteed by random matrix theories Tropp (2015); Woodruff (2014), based on which the convergence rate of SSN is established.
As the second-order methods perform heavy computation in each iteration and converge in a small number of iterations, they have been adapted to solve distributed machine learning aiming at reducing the communication cost Shamir et al. (2014); Mahajan et al. (2015); Zhang and Lin (2015); Reddi et al. (2016); Shusen Wang et al. (2018). In particular, the Globally Improved Approximate NewTon Method (GIANT) method is based on the same idea as SSN and has fast convergence rate.
As well as Newton’s method, SSN is not directly applicable for (1) because the objective function is non-smooth. Following the proximal-Newton method Lee et al. (2014), SSN has been adapted to solve convex optimization with non-smooth regularization Liu et al. (2017). SSN has also been applied to optimize nonconvex problem Xu et al. (2017); Tripuraneni et al. (2017).
1.1 Our contributions
Recall that is the total number of samples, is the number of features, and is the size of the randomly sampled subset. (Obviously , otherwise the subsampling does not speed up computation.) The existing theories of SSN require to be at least . For the big-data setting, i.e., , the existing theories nicely guarantee the convergence of SSN.
However, high-dimensional data is not uncommon at all in machine learning; can be comparable to or even greater than . Thus requiring both and seriously limits the application of SSN. We considers the question:
Do SSN and its variants work for (1) when ?
The empirical studies in Xu et al. (2016, 2017); Shusen Wang et al. (2018) indicate that yes, SSN and its extensions have fast convergence even if is substantially smaller than . However, their empirical observations are not supported by theory.
This work bridges the gap between theory and practice for convex empirical risk minimization. We show it suffices to use uniformly sampled subset to approximate the Hessian, where is the regularization parameter, () is the -effective-dimension of the Hessian matrix, and hides the constant and logarithmic factors. If is larger than most of the eigenvalues of the Hessian, then is tremendously smaller than Cohen et al. (2015). Our theory is applicable to three SSN methods.
- •
In Section 3, we study the convex and smooth problem (2). we show the convergence of the standard SSN with the effective-dimension dependence and improves Xu et al. (2016).
- •
In Section 4, for the same optimization problem (2), we extend the result to the distributed computing setting and improves the bound of GIANT Shusen Wang et al. (2018).
- •
In Section 5, we study a convex but nonsmooth problem (1) and analyze the combination of SSN and proximal-Newton.
In Section 6, we analyse SSN methods when the subproblems are inexactly solved. The proofs of the main theorems are in the appendix.
2 Notation and Preliminary
Basic matrix notation.
Let be the indentity matrix. Let denote the vector norm and denote the matrix spectral norm. Let
[TABLE]
be its singular value decomposition (SVD), with its largest singular value and the smallest (the -th largest). The moore-Penrose inverse of is defined by {\bf A}^{\dagger}={\bf V}\mbox{\boldmath\Sigma\unboldmath}^{-1}{\bf U}^{T}. If a symmetric real matrix has no negative eigenvalues, it is called symmetric positive semidefinite (SPSD). We denote if is SPSD. For the SPD matrice , we define a norm by and its conditional number by .
Ridge leverage scores.
For , its row -ridge leverage score () is defined by
[TABLE]
for . Here and are defined in (5). For , is the standard leverage score used by Drineas et al. (2008); Michael W. Mahoney (2011).
Effective dimension.
The -effective dimension of is defined by
[TABLE]
If is larger than most of the singular values of , then is tremendously smaller than Alaoui and Mahoney (2015); Cohen et al. (2017). In fact, to trade-off the bias and variance, the optimal setting of makes comparable to the top singular values of Hsu et al. (2014); Wang et al. (2018), and thus is small in practice.
Ridge coherence.
The row -ridge coherence of is
[TABLE]
which measures the extent to which the information in the rows concentrates. If has most of its mass in a relatively small number of rows, its -ridge coherence could be high. This concept is necessary for matrix approximation via uniform sampling. It could be imagined that if most information is in a few rows, which means high coherence, then uniform sampling is likely to miss some of the important rows, leading to low approximation quality. When , it coincides with the standard row coherence
[TABLE]
which is widely used to analyze techniques such as compressed sensing Candes et al. (2006), matrix completion Candès and Recht (2009), robust PCA Candès et al. (2011) and so on.
Gradient and Hessian.
For the optimization problem (2), the gradient of at is
[TABLE]
The Hessian matrix at is
[TABLE]
Let and
[TABLE]
In this way, the Hessian matrix can be expressed as
[TABLE]
3 Sub-Sampled Newton (SSN)
In this section, we provide new and stronger convergence guarantees for the SSN methods. For SSN with uniform sampling, we require a subsample size of ; For SSN with ridge leverage score sampling,111We do not describe the ridge leverage score sampling in detail; the readers can refer to Alaoui and Mahoney (2015); Cohen et al. (2015). a smaller sample size, , suffices. Because is typically much smaller than , our new results guarantee convergence when .
3.1 Algorithm description
We set an interger () and uniformly sample items out of to form the subset . In the -th iteration, we form the matrix which contains the rows of indexed by and the full gradient . Then, the approximately Newton direction is computed by solving the linear system
[TABLE]
by either matrix inversion or the conjugate gradient. Finally, is updated by
[TABLE]
where can be set to one or found by line search. In the rest of this section, we only consider .
Most of the computation is performed in solving (11). The only difference between the standard Newton and the SSN methods is replacing by . Compared to Newton’s method, SSN leads to an almost -factor speed up of the per-iteration computation; however, SSN requires more iterations to converge. Nevertheless, to reach a fixed precision, the overall cost of SSN is much lower than Newton’s method.
3.2 Our improved convergence bounds
Improved bound for quadratic loss.
We let be the unique (due to the strong convexity) optimal solution to the problem 1, be the intermediate output of the -th iteration, and \mbox{\boldmath\Delta\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{\star}. If the loss function of (1) is quadratic, e.g., , the Hessian matrix does not change with the iteration, so we use and instead. Theorem 1 guarantees the global convergence of SSN.
Theorem 1** (Global Convergence).**
*Let and respectively be the -ridge leverage score and -coherence of , and be the condition number of . Let and be any user-specified constants. Assume the loss function of (1) is quadratic. For a sufficiently large sub-sample size: *
[TABLE]
*with probability at least , *
[TABLE]
Proof.
We prove the theorem in Appendix B.2. ∎
Improved bound for non-quadratic loss.
If the loss function of (1) is non-quadratic, the Hessian matrix changes with iteration, and we can only guarantee fast local convergence, as well as the prior works Roosta Khorasani and Mahoney (2016); Xu et al. (2016). We make a standard assumption on the Hessian matrix, which is required by all the prior works on Newton-type methods.
Assumption 1**.**
The Hessian matrix is -Lipschitz continuous, i.e., , for arbitrary and .
Theorem 2** (Local Convergence).**
*Let respectively be the -ridge leverage score and -coherence of . Let and be any user-specified constants. Let Assumption 1 be satisfied. For a sufficiently large sub-sample size: *
[TABLE]
*with probability at least , *
[TABLE]
where is the condition number.
Proof.
We prove the theorem in Appendix B.3. ∎
Theorem 3**.**
*If ridge leverage score sampling is used instead, the sample complexity in Theorems 1 and 2 will be improved to *
[TABLE]
Remark 1**.**
Ridge leverage score sampling eliminates the dependence on the coherence, and the bound is stronger than all the existing sample complexities for SSN. We prove the corollary in Appendix B.4. However, the ridge leverage score sampling is expensive and impractical and thus has only theoretical interest.
Although Newton-type methods empirically demonstrate fast global convergence in almost all the real-world applications, they do not have strong global convergence guarantee. A weak global convergence bound for SSN was established by Roosta Khorasani and Mahoney (2016). We do not further discuss the global convergence issue in this paper.
3.3 Comparison with prior work
For SSN with uniform sampling, the prior work Roosta Khorasani and Mahoney (2016) showed that to obtain the same convergence bounds as ours, (12) and (13), the sample complexity should be
[TABLE]
In comparison, to obtain a same convergence rate, our sample complexity has a better dependence on the condition number and the dimensionality.
For the row norm square sampling of Xu et al. (2016), which is slightly more expensive than uniform sampling, a sample complexity of
[TABLE]
suffices for the same convergence rates as ours, (12) and (13). Their bound may or may not guarantee convergence for . Even if is larger than most of the singular values of , their required sample complexity can be large.
For leverage score sampling, Xu et al. (2016) showed that to obtain the same convergence bounds as ours, (12) and (13), the sample complexity should be
[TABLE]
which depends on (worse than ours ) but does not depend on coherence. We show that if the ridge leverage score sampling is used, then s=\Theta\big{(}\tfrac{{d_{\text{eff}}^{\gamma}}}{\varepsilon^{2}}\log\tfrac{{d_{\text{eff}}^{\gamma}}}{\delta}\big{)} samples suffices, which is better than the above sample complexity. However, because approximately computing the (ridge) leverage scores is expensive, neither the leverage score sampling of Xu et al. (2016) nor the ridge leverage score sampling proposed by us is a practical choice.
4 Distributed Newton-Type Method
Communication-efficient distributed optimization is an important research field, and second-order methods have been developed to reduce the communication cost, e.g., DANE Shamir et al. (2014), AIDE Reddi et al. (2016), DiSCO Zhang and Lin (2015) and GIANT Shusen Wang et al. (2018). Among them, GIANT has the strongest convergence bound. In this section, we further improve the convergence analysis of GIANT and show that GIANT does converge when the local sample size, , is smaller the number of features, .
4.1 Motivation and algorithm description
Assume the samples are partition among worker machines uniformly at random. Each worker machine has its own processors and memory, and the worker machines can communicate by message passing. The communication are costly compared to the local computation; when the number of worker machines is large, the communication is oftentimes the bottleneck of distributed computing. Thus there is a strong desire to reduce the communication cost of distributed computing. Our goal is to solve the optimization problem (1) in a communication-efficient way.
The first-order methods are computation-efficient but not communication-efficient. Let us take the gradient descent for example. In each iteration, with the iteration at hand, the -th worker machine uses its local data to compute a local gradient ; Then the driver machine averages the local gradient to form the exact gradient and update the model by
[TABLE]
where is the step size. Although each iteration is computationally efficient, the first-order methods (even with acceleration) take many iterations to converge, especially when the condition number is big. As each iteration requires broadcasting and an aggregation of the local gradients to form , the total number and complexity of communication are big.
Many second-order methods have been developed to improve the communication-efficiency, among which the Globally Improved Approximate NewTon (GIANT) method Shusen Wang et al. (2018) has the strongest convergence rates. Let be the local sample size and be the block of , which is previously defined in (9), formed by the -th worker machine. With the iteration at hand, the -th worker machine can use its local data samples to form the local Hessian matrix
[TABLE]
and outputs the local Approximate NewTon (ANT) direction
[TABLE]
Finally, the driver machine averages the ANT direction
[TABLE]
and perform the update
[TABLE]
where the step size can be set to one under certain conditions; we only consider the case in the rest of this section.
GIANT is much more communication-efficient than the first-order methods. With fixed, each iteration of GIANT has four rounds of communications: (1) broadcasting , (2) aggregating the local gradients to form , (3) broadcasting , and (4) aggregating the ANT directions to form ; thus the per-iteration communication cost is just twice as much as a first-order method. Shusen Wang et al. (2018) showed that GIANT requires a much smaller number of iterations than the accelerated gradient method which has the optimal iteration complexity (without using second-order information).
4.2 Our improved convergence bounds
We analyze the GIANT method and improve the convergence analysis of Shusen Wang et al. (2018), which was the strongest theory in terms of communication efficiency. Throughout this section, we assume the samples are partitioned to worker machine uniformly at random.
Improved bound for quadratic loss.
We let be the unique optimal solution to the problem 1 and \mbox{\boldmath\Delta\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{\star}. If the loss function of (1) is quadratic, e.g., , the Hessian matrix does not change with the iteration, so we use and instead. Theorem 4 guarantees the global convergence of GIANT.
Theorem 4** (Global Convergence).**
*Let respectively be the -ridge leverage score and -coherence of , and be the condition number of . Let and be any user-specified constants. Assume the loss function of (1) is quadratic. For a sufficiently large sub-sample size: *
[TABLE]
*with probability at least , *
[TABLE]
Proof.
We prove the theorem in Appendix C.2. ∎
Improved bound for non-quadratic loss.
If the loss function of (1) is non-quadratic, we can only guarantee fast local convergence under Assumption 1, as well as the prior works Shusen Wang et al. (2018).
Theorem 5** (Local Convergence).**
*Let respectively be the -ridge leverage score and -coherence of . Let and be any user-specified constants. Let Assumption 1 be satisfied. For a sufficiently large sub-sample size: *
[TABLE]
*with probability at least , *
[TABLE]
where is the condition number.
Proof.
We prove the theorem in Appendix C.3 ∎
Remark 2**.**
GIANT is a variant of SSN: SSN uses one of as the descending direction, whereas GIANT uses the averages of the directions. As a benefit of the averaging, the sample complexity is improved from s=\tilde{\Theta}\big{(}\tfrac{{d_{\text{eff}}^{\gamma}}}{\epsilon^{2}}\big{)} to s=\tilde{\Theta}\big{(}\tfrac{{d_{\text{eff}}^{\gamma}}}{\epsilon}\big{)}.
4.3 Comparison with prior work
To guarantee the same convergence bounds, (15) and (16), Shusen Wang et al. require a sample complexity of .222The sample complexity in Shusen Wang et al. (2018) is actually slightly worse; but it is almost trivial to improve their result to what we showed here. This requires require the local sample size be greater than , even if the coherence is small. As communication and synchronization costs grow with , the communication-efficient method, GIANT, is most useful for the large setting; in this case, the requirement is unlikely satisfied.
In contrast, our improved bounds do not require . As can be tremendously smaller than , our requirement can be satisfied even if and are both large. Our bounds match the empirical observation of Shusen Wang et al. (2018): GIANT convergences rapidly even if is larger than .
5 Sub-Sampled Proximal Newton (SSPN)
In the previous sections, we analyze second-order methods for the optimization problem (1) which has a smooth objective function. In this section, we study a harder problem:
[TABLE]
where is a non-smooth function. The standard Newton’s method does not apply because the second derivative of the objective function does not exist. Proximal Newton Lee et al. (2014), a second-order method, was developed to solve the problem, and later on, sub-sampling was incorporated to speed up computation Liu et al. (2017). We further improve the bounds of Sub-Sampled Proximal Newton (SSPN).
5.1 Algorithm Description
Let be the smooth part of the objective function, and and be its first and second derivatives at . The proximal Newton method Lee et al. (2014) iterative solves the problem:
[TABLE]
and then perform the update . The righthand side of the problem is a local quadratic approximation to at . If , then proximal Newton is the same as the standard Newton’s method.
The sub-sampled proximal Newton (SSPN) method uses sub-sampling to approximate ; let the approximate Hessian matrix be , as previously defined in (4). SSPN computes the ascending direction by solving the local quadratic approximation:
[TABLE]
and then perform the update .
5.2 Our improved error convergence bounds
We show that SSPN has exactly the same iteration complexity as SSN, for either quadratic or non-quadratic function . Nevertheless, the overall time complexity of SSPN is higher than SSN, as the subproblem (17) is expensive to solve if is non-smooth.
Theorem 6**.**
Theorems 1, 2, and 3 hold for SSPN.
Proof.
We prove the theorem in Appendix D.3 and D.4. ∎
5.3 Comparison with prior work
Liu et al. (2017) showed that when \|\mbox{\boldmath\Delta\unboldmath}_{t}\|_{2} is small enough, \|\mbox{\boldmath\Delta\unboldmath}_{t+1}\|_{2} will converge to zero linear-quadratically, similar to our results. But their sample complexity is
[TABLE]
This requires the sample size to be greater than . The regularization is often used for high-dimensional data, the requirement that is too restrictive.
Our improved bounds show that suffices for uniform sampling and that suffices for ridge leverage score sampling. Since can be tremendously smaller than when , our bounds are useful for high-dimensional data.
6 Inexactly Solving the Subproblems
Each iteration of SSN (Section 3) and GIANT (Section 4) involves solving a subproblem in the form
[TABLE]
Exactly solving this problem would perform the multiplication and decompose the approximate Hessian matrix ; the time complexity is . In practice, it can be approximately solved by the conjugate gradient (CG) method, each iteration of which applies a vector to and ; the time complexity is , where is the number of CG iterations and is the number of nonzeros. The inexact solution is particularly appealing if the data are sparse. In the following, we analyze the effect of the inexact solution of the subproblem.
Let be the condition number of . For smooth problems, Shusen Wang et al. (2018) showed that by performing
[TABLE]
CG iterations, the conditions (18) and (19) are satisfied, and the inexact solution does not much affect the convergence of SSN and GIANT.
Corollary 7** (SSN).**
*Let and be respectively the exact and an inexact solution to the quadratic problem . SSN updates by . If the condition *
[TABLE]
is satisfied for some , then Theorems 1 and 2, with in (12) and (13) replaced by , continue to hold.
Proof.
We prove the corollary in Appendix E.1. ∎
Corollary 8** (GIANT).**
*Let and be respectively the exact and an inexact solution to the quadratic problem . GIANT updates by . If the condition *
[TABLE]
is satisfied for some and all , then Theorems 4 and 5, with in (15) and (16) replaced by , continue to hold.
Proof.
The corollary can be proved in almost the same way as Shusen Wang et al. (2018). So we do not repeat the proof. ∎
SSPN is designed for problems with non-smooth regularization, in which case finding the exact solution may be infeasible, and the sub-problem can only be inexactly solved. If the inexact satisfies the same condition (18), Corollary 9 will guarantee the convergence rate of SSPN.
Corollary 9** (SSPN).**
Let and be respectively the exact and an inexact solution to the non-smooth problem (17). SSPN updates by . If satisfies the condition (18) for any , then Theorems 6 still holds for SSPN with replaced by .
Proof.
We prove the corollary in Appendix E.2. ∎
7 Conclusion
We studied the subsampled Newton (SSN) method and its variants, GIANT and SSPN, and established stronger convergence guarantees than the prior works. In particular, we showed that a sample size of suffices, where is the regularization parameter and is the effective dimension. When is larger than most of the eigenvalues of the Hessian matrices, is much smaller than the dimension of data, . Therefore, our work guarantees the convergence of SSN, GIANT, and SSPN on high-dimensional data where is comparable to or even greater than . In contrast, all the prior works required a conservative sample size to attain the same convergence rate as ours. Because subsampling means that is much smaller than , the prior works did not lend any guarantee to SSN on high-dimensional data.
Appendix A Random Sampling for Matrix Approximation
Here, we give a short introduction to random sampling and their theoretical properties. Given a matrix , row selection constructs a smaller size matrix () as an approximation of . The rows of is constructed using a randomly sampled and carefully scaled subset of the rows of . Let be the sampling probability associated with the rows of . The rows of is selected independently according to the sampling probability such that for all , we have
[TABLE]
where and are the -th row of and -th row of . In a matrix multiplication form, can be formed as
[TABLE]
where is called the sketching matrix. As a result of row selection, there is only a non-zero entry in each column of , whose position and value correspond to the sampled row of .
Uniform sampling.
Uniform sampling simply sets all the sampling probabilities equal, i.e., . Its corresponding sketching matrix is often called uniform sampling matrix. The non-zero entry in each column of is the same, i.e., . If is sufficiently large,
[TABLE]
is a good approximation to .
Lemma 10** (Uniform Sampling).**
*Let and be defined as that in (10) and (20). Denote for simplicity. Given arbitrary error tolerance and failure probability , when *
[TABLE]
*the spectral approximation holds with probability at least : *
[TABLE]
Proof.
The proof trivially follows from Cohen et al. (2015). ∎
Ridge leverage score sampling.
It takes proportional to the -th ridge leverage score, i.e.,
[TABLE]
where is the ridge leverage score of the -th row of . Let be its sketching matrix. Then the non-zero entry in -th column of is if the -th row of is drawn from the -th row of , where is defined as (21). If the ridge leverage score sampling is used to approximate the Hessian matrix, the approximate Hessian matrix turns to
[TABLE]
The sample complexity in Theorems 1 and 2 will be improved to s=\Theta\big{(}\tfrac{{d_{\text{eff}}^{\gamma}}}{\varepsilon^{2}}\log\tfrac{{d_{\text{eff}}^{\gamma}}}{\delta}\big{)}.
Lemma 11** (Ridge Leverage Rampling).**
Let and be defined as that in (10) and (22). Denote for simplicity. Given arbitrary error tolerance and failure probability , when
[TABLE]
the spectral approximation holds with probability at least :
[TABLE]
Proof.
The proof trivially follows from Cohen et al. (2015). ∎
Appendix B Convergence of Sub-Sampled Newton
In this section, we first give a framework of analyzing the recursion of \mbox{\boldmath\Delta\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*}, which also inspires the proofs for distributed Newton-type Method and SSPN. Within this simple framework, we then complete the proofs for the global and local convergence for SSN.
B.1 A analyzing framework
Approximate Newton Direction.
We can view the process of solving the newton direction from the linear system
[TABLE]
as a convex optimization. Recalling that is defined in (9), we define the quadratic auxiliary function
[TABLE]
Obviously, the true Newton direction is the critical point of :
[TABLE]
Since we use subsampled Hessian , we solve the approximate Newton direction from (11) instead of (23), thus the counterpart of is defined
[TABLE]
It is easy to verify the approximate Newton direction is the minimizers of (26), i.e.,
[TABLE]
Lemma 12 shows that is close to in terms of the value of , if the subsampled Hessian , which is used to establish the linear system satisfies, is a good spectral approximation of .
Lemma 12** (Approximate Newton Direction).**
*Assume holds already. Let , and be defined respectively in (24), (25) and (27). It holds that *
[TABLE]
where .
Proof.
Since we now analyze the approximate Newton direction locally, we leave out all subscript for simplicity. By the assumption that , we conclude that there must exist a symmetric matrix such that
[TABLE]
By the definition of and , we have
[TABLE]
where the second equation is the result of for nonsingular matrixs and . The last equation holds since .
It follows that
[TABLE]
where the third inequality follows from \big{\|}\mbox{\boldmath\Gamma\unboldmath}\big{\|}\leq\frac{\varepsilon}{1-\varepsilon} and the last inequality holds due to \big{\|}\mbox{\boldmath\Omega\unboldmath}\big{\|}\leq\varepsilon.
Thus it follows from \phi({\bf p}^{*})=-\big{\|}{\bf H}^{\frac{1}{2}}{\bf p}^{*}\big{\|}_{2}^{2} and the definition of that
[TABLE]
Combining (LABEL:eq:snn_lem_upperbound2) and (28), we have that
[TABLE]
where . Then the lemme follows. ∎
Approximate Newton Step.
If is very close to (in terms of the value of the auxiliary function ), then the direction , along which the parameter will descend, can be considered provably as a good along direction. Provided that is a good descending direction, Lemma 13 establishes the recursion of \mbox{\boldmath\Delta\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*} after one step of direction descend.
Lemma 13** (Approximate Newton Step).**
*Let Assumption (1) (i.e., the Hessian matrix is -Lipschitz) hold. Let be any fixed error tolerance. If satisfies *
[TABLE]
*Then \mbox{\boldmath\Delta\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*} satisfies *
[TABLE]
Proof.
See the proof of Lemma 9 in Shusen Wang et al. (2018). ∎
Error Recursion.
By combining all the lemmas above, we can analyze the recursion of \mbox{\boldmath\Delta\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*} for SSN.
Let and respectively be the -ridge leverage score and -coherence of . From Lemma 10, when , is a spectral approximation of . By Lemma 12, the approximate Newton direction , solved from the linear system , is not far from in terms of the value of with . It then follows from Lemma 13 that
[TABLE]
which establishes the recursion of \mbox{\boldmath\Delta\unboldmath}_{t} for SSN.
B.2 Proof of SNN for quadratic loss
Proof of Theorem 1.
Since the loss is quadratic, w.l.o.g, let and . Let and respectively be the -ridge leverage score and -coherence of , and be the condition number of . Note that due to the quadratic loss. Let . Since , then .
From the last part of the analysis in B.1, \mbox{\boldmath\Delta\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*} satisfies the error recursion inequality (30) with , i.e.,
[TABLE]
By recursion, it follows that
[TABLE]
Then the theorem follows. ∎
B.3 Proof of SNN for non-quadratic loss
Proof of Theorem 2.
Let , and . Since , then .
From the last part of the analysis in B.1, \mbox{\boldmath\Delta\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*} satisfies the error recursion inequality (30), i.e.,
[TABLE]
Let is the condition number. By plugging and into (30), it follows that
[TABLE]
Vewing (31) as a one-variable quadratic inequality about \big{\|}\mbox{\boldmath\Delta\unboldmath}_{t+1}\big{\|}_{2} and solving it, we have
[TABLE]
where the second inequality follows from for and the last inequality holds by reorganizations and the fact . Then the theorem follows. ∎
B.4 Proof of Theorem 3
Theorem 3 can be proved in the same way as Theorems 1 and 2; the only difference is using Lemma 11 instead of Lemma 10
Appendix C Convergence of GIANT
We still use the framework described in Appendix B.1 to prove the results for GIANT. But two modification should be made in the proof. The first one lies in the part of analyzing Uniform Sampling, since data are distributed and only accessible locally and subsampled Hessians are constructed locally. We can prove each worker machine can simultaneously obtain a spectral approximation of the Hessian matrix. The second one lies in the part of analyzing Approximate Newton Direction, since GIANT uses the global Newton direction, which is the average of all local ANT directions, to update parameters. We can prove the global Newton direction is still a good descending direction. Once above two modifications are solid established, we prove the main theorems for GIANT.
C.1 Two modifications
Simultaneous Uniform Sampling.
We can assume these samples in each worker machine are randomly draw from . This assumption is reasonable because if the samples are i.i.d. drawn from some distribution, then a data-independent partition can be viewed as uniformly sampling equivalently.
Recall that contains the rows of selected by -th work machine in iteration . Let be the associated uniform sampling matrix with each column only one non-zero number . Then . The -th local subsampled Hessian matrix is formed as
[TABLE]
Lemma 14** (Simultaneous Uniform Sampling).**
*Let be fixed parameters. Let be defined in (10) and (32). Denote for simplicity. Then when *
[TABLE]
*with probability at least , the spectral approximation holds simultaneously, i.e., *
[TABLE]
Proof.
Since we analyze each local , we leave out the subscribe for simplicity.
By Lemma 10, we know with probability , when , it follows that ,
[TABLE]
By Bonferroni’s method, we know with probability , the spectral approximation holds simultaneously. ∎
Global Approximate Newton Direction
Recall that the gradient at iteration is . The local ANT computed by -th worker machine is . The global Newton direction is formed as
[TABLE]
where is defined as the harmonic mean of , i.e.,
[TABLE]
Lemma 15** (Model average).**
Assume condition (33) holds for given . Let and be defined respectively in (24) and (34). It holds that
[TABLE]
where .
Proof.
We leave out all subscript for simplicity. It follows from condition (33) that there must exist a symmetric matrix \mbox{\boldmath\Gamma\unboldmath}_{i} such that
[TABLE]
By the definition of and , we have
[TABLE]
where the second equation is the result of for nonsingular matrixs and and the last equation holds since .
It follows that
[TABLE]
It follows from the assumption (33) that
[TABLE]
Let be the concatenation of . Then is a uniform sampling matrix which samples rows. Actually, is a permutation matrix, with every row and column containing precisely a single 1 with 0s everywhere else. Therefore,
[TABLE]
It follows from (35), (36) and (37) that
[TABLE]
By the definition of and (38), it follows that
[TABLE]
where . Then the lemme follows from \phi({\bf p}^{*})=-\big{\|}{\bf H}^{\frac{1}{2}}{\bf p}^{*}\big{\|}_{2}^{2}. ∎
Error Recursion.
Plugging above two modifications into the analysis framework described in Appendix B.1, we can analyze the recursion of \mbox{\boldmath\Delta\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*} for GIANT.
Let and respectively be the -ridge leverage score and -coherence of . From Lemma 14, when , for each , is a spectral approximation of . By Lemma 15, the global ANT , an average of all local APTs , is not far from in terms of the value of with . It then follows from Lemma 13 that (30) still holds but with , i.e.,
[TABLE]
which establishes the recursion of \mbox{\boldmath\Delta\unboldmath}_{t} for GIANT.
C.2 Proof of GIANT for quadratic loss
Proof of Theorem 4.
Since the loss is quadratic, w.l.o.g, let and . Let and respectively be the -ridge leverage score and -coherence of , and be the condition number of . Note that due to the quadratic loss. Let . Since , then .
From the last part of the analysis in C.1, \mbox{\boldmath\Delta\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*} satisfies the error recursion inequality (30) with , i.e.,
[TABLE]
By recursion, it follows that
[TABLE]
Then the theorem follows. ∎
C.3 Proof of GIANT for non-quadratic loss
Proof of Theorem 5.
Let , and . Since , then .
From the last part of the analysis in C.1, \mbox{\boldmath\Delta\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*} satisfies the error recursion inequality (30), i.e.,
[TABLE]
Let is the condition number. By plugging and into (30), we can obtain a one-variable quadratic inequality about \big{\|}\mbox{\boldmath\Delta\unboldmath}_{t+1}\big{\|}_{2}, which is almost the same form as (31) except the value of . Solving it, we have
[TABLE]
Then the theorem follows from . ∎
Appendix D Convergence of SSPN
Since the proximal mapping of the non-smooth part is used, rather than direct gradient descend, the analysis of Approximate Newton Step should be modified. We first introduce two properties of proximal mapping, and then provides the error recursion of \mbox{\boldmath\Delta\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*} for SSPN. Based on that error recursion, we can prove the global convergence for quadratic loss and the local convergence for non-quadratic loss for SSPN.
D.1 Proximal mapping
The definition of the proximal operater is merely for theoretical analysis. So we move it to the appendix. The proximal mapping is defined as
[TABLE]
which involve the current point , the convex (perhaps non-smooth) function , and the SPSD precondition matrix . The update rule of SSPN is:
[TABLE]
SSN can also be written in this form, with .
The proximal mapping enjoys the nonexpansiveness property and fixed point property Lee et al. (2014).
Lemma 16** (Nonexpansiveness).**
*Let be a convex function and be a SPSD matrix. The scaled proximal mapping is nonexpansive, i.e., for all and and , *
[TABLE]
Lemma 17** (Fixed Point Property of Minimizers).**
*Let be convex and twice differential and be convex. Let be the gradient of at , i.e., . Then for any SPSD matrix , minimizes if and only if *
[TABLE]
D.2 Analysis of Approximate Newton Step
Lemma 18** (Error Recursion).**
*Assume and Assumption 1 (i.e., the Hessian Lipschitz continuity) hold. Let \mbox{\boldmath\Delta\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*}, we have *
[TABLE]
Proof.
Recall that the updating rule is
[TABLE]
Let be the minimizer and be the its gradient of the smooth part of the objective (i.e., F({\bf w})=\frac{1}{n}\sum_{j=1}^{n}l_{j}({\bf x}_{j}^{T}{\bf w})+\frac{\gamma}{2}\big{\|}{\bf w}\big{\|}_{2}^{2}). It follows that
[TABLE]
where the first equality results from Lemma 17 and the first inequality results from Lemma 16.
Let \mbox{\boldmath\Delta\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*} and {\bf Z}_{t}={\bf H}_{t}\mbox{\boldmath\Delta\unboldmath}_{t}-({\bf g}_{t}-{\bf g}^{*}) for short. Then
[TABLE]
The first inequality is a rephrasing of (40). The second inequality follows from the triangle inequality and the definition of . The third inequality holds due to and .
Next we upper bound \big{\|}{\bf Z}_{t}\big{\|}_{2}. It follows that
[TABLE]
The second line holds due to
[TABLE]
The third line follows from Cauchy inequality and the definition . The last inequality holds due to the Hessian Lipschitz continuity (Assumption 1).
Note that \big{\|}\mbox{\boldmath\Delta\unboldmath}_{t+1}\big{\|}_{{\bf H}_{t}}\leq\frac{1}{\sqrt{1-\varepsilon}}\big{\|}\mbox{\boldmath\Delta\unboldmath}_{t+1}\big{\|}_{\widetilde{\bf H}_{t}}. Thus the lemma follows from this equality, (41) and (42), i.e.,
[TABLE]
∎
D.3 Proof of SSPN for quadratic loss
Theorem 19** (Formal statement of Theorem 6 for quadratic loss).**
*Let and respectively be the -ridge leverage score and -coherence of , and be the condition number of . Let and be any user-specified constants. Assume each loss function is quadratic. For a sufficiently large sub-sample size: *
[TABLE]
*with probability at least , *
[TABLE]
Proof of Theorem 19.
Under the same condition as Theorem 1, by Lemma 18, it follows that
[TABLE]
Since the loss is quadratic then . Let and . Let and respectively be the -ridge leverage score and -coherence of , and be the condition number of . Then we have
[TABLE]
By recursion, it follows that
[TABLE]
where and is the conditional number. ∎
D.4 Proof of SSPN for non-quadratic loss
Theorem 20** (Formal statement of Theorem 6 for non-quadratic loss).**
*Let respectively be the -ridge leverage score and -coherence of . Let and be any user-specified constants. Let Assumption 1 be satisfied. For a sufficiently large sub-sample size: *
[TABLE]
*with probability at least , *
[TABLE]
where is the condition number.
Proof of Theorem 20.
By plugging into the result of Lemma 18, it follows that
[TABLE]
where is the condition number.
Since , is bounded by 2. Thus we have
[TABLE]
which proves this theorem. ∎
Appendix E Inexact Solution to Sub-Problems
The computation complexity can be alleviated when the Conjugate Gradient (CG) method is used to compute the inexact Newton step. This methodology has been discussed and practiced before Shusen Wang et al. (2018). In this section, we prove that SSN and GIANT can benefit from inexact Newton step. What’s more, we provide an theoretical bound for inexact solution for SSPN.
The framework described in Appendix B.1 can help us to prove results for SSN and GIANT. Since CG produces an approximate solution for the linear system which the approximate Newton direction satisfies, the analysis of Approximate Newton Direction in Appendix B.1 should be modified. We can prove that when the inexact solution satisfies the particular stopping condition, it is close to the exact Newton direction in terms of the value of .
E.1 Inexactly solving for SSN
In the -th iteration, the exact solution is , where is the subsampled Hessian defined in (10). Let be the inexact solution produced by CG. It satisfies the stopping condition (18), i.e.,
[TABLE]
Thus SSN takes inexact Newton direction to update the parameter instead of .
Lemma 21** (Inexact solution for SSN).**
For given , assume holds. Let be defined in (24). Let be the inexact solution satisfying (18). Then it holds that
[TABLE]
where .
Proof.
We leave out the subscript for simplicity.
It follows from the stopping condition (18) that
[TABLE]
Since , it follows that
[TABLE]
Then it follows that
[TABLE]
where the last inequality is due to (28) (which results from Lemma 12).
By the definition of and (43), it follows that
[TABLE]
where . ∎
E.2 Inexactly solving for SSPN
When the inexact solution is used, the update rule of SSPN becomes:
[TABLE]
where is the inexact solution. CG produces via inexactly solving the linear system with stopping condition (18), which is equivalent to
[TABLE]
Lemma 22** (Inexact solution for GIANT).**
*Assume and Assumption 1 (i.e., the Hessian Lipschitz continuity) hold. Assume . Let \mbox{\boldmath\Delta\unboldmath}_{t}={\bf w}_{t}-{\bf w}^{*}. Let be an approximation to the SSPN direction and , we have *
[TABLE]
where .
Proof.
By the updating rules and , we obtain
[TABLE]
It follows from (18) that
[TABLE]
and thus
[TABLE]
where .
Since , it follows that
[TABLE]
It follows from the bound on (Lemma 18) that
[TABLE]
Thus we can obtain
[TABLE]
where and are some function of and satisfying
[TABLE]
Since , it follows that
[TABLE]
Then the lemma follows. ∎
Proof of Corollary 9.
Similar to Lemma 18, Lemma 21 show that when the inexact SSPN direction is used, still quadratic-linearly converges. The only difference in their conclusions is that in Lemma 18 now is changed into in Lemma 21. In the proof of Theorem 19 and 20, we can replace Lemma 18 with Lemma 21, then results still hold for inexactly solving for SSPN, except that the value of is changed into .
Since we can always determine what to choose in advance, we can use a spectral approximation of (which will slightly change the value of but will not change its order), thus will become as Corollary 9 states. Therefore we prove Corollary 9. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Alaoui and Mahoney [2015] Ahmed Alaoui and Michael W Mahoney. Fast Randomized Kernel Ridge Regression with Statistical Guarantees. In Advances in Neural Information Processing Systems (NIPS) . 2015.
- 2Berahas et al. [2017] Albert S Berahas, Raghu Bollapragada, and Jorge Nocedal. An investigation of Newton-sketch and subsampled Newton methods. ar Xiv preprint ar Xiv:1705.06211 , 2017.
- 3Bubeck [2014] Sébastien Bubeck. Theory of convex optimization for machine learning. ar Xiv preprint ar Xiv:1405.4980 , 15, 2014.
- 4Byrd et al. [2011] Richard H Byrd, Gillian M Chin, Will Neveitt, and Jorge Nocedal. On the use of stochastic hessian information in optimization methods for machine learning. SIAM Journal on Optimization , 21(3):977–995, 2011.
- 5Candès and Recht [2009] Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics , 9(6):717, 2009.
- 6Candes et al. [2006] Emmanuel J Candes, Justin K Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences , 59(8):1207–1223, 2006.
- 7Candès et al. [2011] Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM (JACM) , 58(3):11, 2011.
- 8Cohen et al. [2015] Michael B Cohen, Cameron Musco, and Christopher Musco. Ridge leverage scores for low-rank approximation. ar Xiv preprint ar Xiv:1511.07263 , 6, 2015.
