Gradient descent with generalized Newton's method
Zhiqi Bu, Shiyun Xu

TL;DR
The paper introduces the generalized Newton's method (GeN), a Hessian-informed optimizer that automatically adjusts learning rates for faster convergence without extensive tuning, applicable to various optimizers like SGD and Adam.
Contribution
It presents a new optimizer, GeN, that dynamically selects learning rates, improving convergence speed and ease of implementation across different models and tasks.
Findings
GeN matches state-of-the-art performance on language and vision tasks.
GeN requires minimal additional computation and no extensive tuning.
Experiments demonstrate GeN's effectiveness across GPT and ResNet models.
Abstract
We propose the generalized Newton's method (GeN) -- a Hessian-informed approach that applies to any optimizer such as SGD and Adam, and covers the Newton-Raphson method as a sub-case. Our method automatically and dynamically selects the learning rate that accelerates the convergence, without the intensive tuning of the learning rate scheduler. In practice, our method is easily implementable, since it only requires additional forward passes with almost zero computational overhead (in terms of training time and memory cost), if the overhead is amortized over many iterations. We present extensive experiments on language and vision tasks (e.g. GPT and ResNet) to showcase that GeN optimizers match the state-of-the-art performance, which was achieved with carefully tuned learning rate schedulers.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Numerical Analysis Techniques · Iterative Methods for Nonlinear Equations · Reservoir Engineering and Simulation Methods
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Byte Pair Encoding · Cosine Annealing · Layer Normalization · Linear Warmup With Cosine Annealing · Linear Layer · Attention Dropout · Dropout · Dense Connections
