A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models
Yiming Chen, Yuan Zhang, Yin Liu, Kun Yuan, Zaiwen Wen

TL;DR
This paper presents a novel randomized subspace optimization method that significantly reduces memory usage for training large language models, addressing both optimizer states and activations, with proven convergence guarantees and competitive performance.
Contribution
It introduces a new framework that decomposes high-dimensional training into lower-dimensional subproblems, improving memory efficiency and providing theoretical convergence analysis.
Findings
Reduces memory footprint for activations and optimizer states.
Achieves comparable performance to existing methods like GaLore and Adam.
Demonstrates superior memory and communication efficiency in experiments.
Abstract
The memory challenges associated with training Large Language Models (LLMs) have become a critical concern, particularly when using the Adam optimizer. To address this issue, numerous memory-efficient techniques have been proposed, with GaLore standing out as a notable example designed to reduce the memory footprint of optimizer states. However, these approaches do not alleviate the memory burden imposed by activations, rendering them unsuitable for scenarios involving long context sequences or large mini-batches. Moreover, their convergence properties are still not well-understood in the literature. In this work, we introduce a Randomized Subspace Optimization framework for pre-training and fine-tuning LLMs. Our approach decomposes the high-dimensional training problem into a series of lower-dimensional subproblems. At each iteration, a random subspace is selected, and the parameters…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsText and Document Classification Technologies
MethodsAdam
