Beyond First-Order: Training LLMs with Stochastic Conjugate Subgradients and AdamW
Di Zhang, Yihang Zhang

TL;DR
This paper introduces a stochastic conjugate subgradient method with adaptive sampling and AdamW-like adjustments for training large language models, achieving faster convergence and better scalability than traditional SGD.
Contribution
It presents a novel optimization algorithm combining stochastic conjugate subgradients, adaptive sampling, and AdamW-like step size adjustments tailored for LLMs.
Findings
Faster convergence per iteration compared to SGD.
Improved scalability in large-scale LLM training.
Enhanced speed and accuracy in optimization.
Abstract
Stochastic gradient-based descent (SGD), have long been central to training large language models (LLMs). However, their effectiveness is increasingly being questioned, particularly in large-scale applications where empirical evidence suggests potential performance limitations. In response, this paper proposes a stochastic conjugate subgradient method together with adaptive sampling tailored specifically for training LLMs. The method not only achieves faster convergence per iteration but also demonstrates improved scalability compared to traditional SGD techniques. It leverages sample complexity analysis to adaptively choose the sample size, employs a stochastic conjugate subgradient approach to determine search directions and utilizing an AdamW-like algorithm to adaptively adjust step sizes. This approach preserves the key advantages of first-order methods while effectively addressing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
