Memory-Efficient LLM Training with Online Subspace Descent
Kaizhao Liang, Bo Liu, Lizhang Chen, Qiang Liu

TL;DR
This paper introduces Online Subspace Descent, a memory-efficient optimizer with convergence guarantees, that improves large language model training by replacing SVD with online PCA, leading to better performance and lower perplexity.
Contribution
The paper provides the first convergence guarantee for arbitrary update rules in low-rank gradient methods and proposes a novel optimizer using online PCA for efficient LLM training.
Findings
Online Subspace Descent outperforms existing low-rank methods in LLaMA pretraining.
Achieves lower perplexity and better downstream task performance.
Narrower gap with full-rank training methods.
Abstract
Recently, a wide range of memory-efficient LLM training algorithms have gained substantial popularity. These methods leverage the low-rank structure of gradients to project optimizer states into a subspace using projection matrix found by singular value decomposition (SVD). However, convergence of these algorithms is highly dependent on the update rules of their projection matrix. In this work, we provide the \emph{first} convergence guarantee for arbitrary update rules of projection matrix. This guarantee is generally applicable to optimizers that can be analyzed with Hamiltonian Descent, including most common ones, such as LION, Adam. Inspired by our theoretical understanding, we propose Online Subspace Descent, a new family of subspace descent optimizer without SVD. Instead of updating the projection matrix with eigenvectors, Online Subspace Descent updates the projection matrix with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDigital Rights Management and Security
MethodsLLaMA · Principal Components Analysis · Adam
