A block coordinate descent optimizer for classification problems exploiting convexity
Ravi G. Patel, Nathaniel A. Trask, Mamikon A. Gulian, Eric C. Cyr

TL;DR
This paper introduces a hybrid Newton/Gradient Descent method that exploits convexity in the linear layer of deep neural networks, improving training efficiency and accuracy for classification tasks.
Contribution
It presents a novel coordinate descent optimizer leveraging convexity in the linear layer, combining second-order and gradient methods for better training of deep networks.
Findings
Improved validation error on classification tasks
Qualitative differences in learned basis functions
Enhanced training accuracy on image benchmarks
Abstract
Second-order optimizers hold intriguing potential for deep learning, but suffer from increased cost and sensitivity to the non-convexity of the loss surface as compared to gradient-based approaches. We introduce a coordinate descent method to train deep neural networks for classification tasks that exploits global convexity of the cross-entropy loss in the weights of the linear layer. Our hybrid Newton/Gradient Descent (NGD) method is consistent with the interpretation of hidden layers as providing an adaptive basis and the linear layer as providing an optimal fit of the basis to data. By alternating between a second-order method to find globally optimal parameters for the linear layer and gradient descent to train the hidden layers, we ensure an optimal fit of the adaptive basis to data throughout training. The size of the Hessian in the second-order step scales only with the number…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer
