Towards Guided Descent: Optimization Algorithms for Training Neural Networks At Scale
Ansh Nagwekar

TL;DR
This paper explores the evolution of neural network optimization algorithms, emphasizing principled design and advanced techniques like second-order methods to improve training efficiency and understanding.
Contribution
It provides a comprehensive analysis of optimization methods from first-order to higher-order techniques, offering practical strategies for modern deep learning training.
Findings
Limitations of SGD in anisotropic data regimes
Advantages of second-order approximation techniques
Integration strategies for advanced optimizers in training workflows
Abstract
Neural network optimization remains one of the most consequential yet poorly understood challenges in modern AI research, where improvements in training algorithms can lead to enhanced feature learning in foundation models, order-of-magnitude reductions in training time, and improved interpretability into how networks learn. While stochastic gradient descent (SGD) and its variants have become the de facto standard for training deep networks, their success in these over-parameterized regimes often appears more empirical than principled. This thesis investigates this apparent paradox by tracing the evolution of optimization algorithms from classical first-order methods to modern higher-order techniques, revealing how principled algorithmic design can demystify the training process. Starting from first principles with SGD and adaptive gradient methods, the analysis progressively uncovers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Gaussian Processes and Bayesian Inference · Advanced Multi-Objective Optimization Algorithms
