Second-Order Optimization for Non-Convex Machine Learning: An Empirical   Study

Peng Xu; Farbod Roosta-Khorasani; Michael W. Mahoney

arXiv:1708.07827·math.OC·February 19, 2018·35 cites

Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study

Peng Xu, Farbod Roosta-Khorasani, Michael W. Mahoney

PDF

Open Access

TL;DR

This paper empirically evaluates second-order Newton-type methods for non-convex machine learning, showing they are competitive with SGD, more robust to hyper-parameters, and better at escaping saddle points and flat regions.

Contribution

It provides detailed empirical analysis demonstrating the effectiveness and robustness of sub-sampled trust region and ARC methods in non-convex ML tasks.

Findings

01

Newton-type methods match or outperform SGD in generalization.

02

These methods are highly robust to hyper-parameter variations.

03

They effectively escape flat regions and saddle points.

Abstract

While first-order optimization methods such as stochastic gradient descent (SGD) are popular in machine learning (ML), they come with well-known deficiencies, including relatively-slow convergence, sensitivity to the settings of hyper-parameters such as learning rate, stagnation at high training errors, and difficulty in escaping flat regions and saddle points. These issues are particularly acute in highly non-convex settings such as those arising in neural networks. Motivated by this, there has been recent interest in second-order methods that aim to alleviate these shortcomings by capturing curvature information. In this paper, we report detailed empirical evaluations of a class of Newton-type methods, namely sub-sampled variants of trust region (TR) and adaptive regularization with cubics (ARC) algorithms, for non-convex ML problems. In doing so, we demonstrate that these methods not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Machine Learning and ELM

MethodsStochastic Gradient Descent