Beyond First-Order: Training LLMs with Stochastic Conjugate Subgradients and AdamW

Di Zhang; Yihang Zhang

arXiv:2507.01241·cs.LG·July 3, 2025

Beyond First-Order: Training LLMs with Stochastic Conjugate Subgradients and AdamW

Di Zhang, Yihang Zhang

PDF

Open Access

TL;DR

This paper introduces a stochastic conjugate subgradient method with adaptive sampling and AdamW-like adjustments for training large language models, achieving faster convergence and better scalability than traditional SGD.

Contribution

It presents a novel optimization algorithm combining stochastic conjugate subgradients, adaptive sampling, and AdamW-like step size adjustments tailored for LLMs.

Findings

01

Faster convergence per iteration compared to SGD.

02

Improved scalability in large-scale LLM training.

03

Enhanced speed and accuracy in optimization.

Abstract

Stochastic gradient-based descent (SGD), have long been central to training large language models (LLMs). However, their effectiveness is increasingly being questioned, particularly in large-scale applications where empirical evidence suggests potential performance limitations. In response, this paper proposes a stochastic conjugate subgradient method together with adaptive sampling tailored specifically for training LLMs. The method not only achieves faster convergence per iteration but also demonstrates improved scalability compared to traditional SGD techniques. It leverages sample complexity analysis to adaptively choose the sample size, employs a stochastic conjugate subgradient approach to determine search directions and utilizing an AdamW-like algorithm to adaptively adjust step sizes. This approach preserves the key advantages of first-order methods while effectively addressing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis