MISA: Memory-Efficient LLMs Optimization with Module-wise Importance Sampling
Yuxi Liu, Renjia Deng, Yutong He, Xue Wang, Tao Yao, Kun Yuan

TL;DR
MISA introduces a module-wise importance sampling method that enhances memory efficiency and convergence in large language model optimization by adaptively activating smaller model modules.
Contribution
It proposes a novel module-wise importance sampling approach that reduces memory usage and gradient variance, improving optimization efficiency for large language models.
Findings
MISA achieves lower memory consumption compared to layer-wise methods.
The method provides a provable (1/sqrt{K}) convergence rate.
Experiments demonstrate MISA's effectiveness across various tasks.
Abstract
The substantial memory demands of pre-training and fine-tuning large language models (LLMs) require memory-efficient optimization algorithms. One promising approach is layer-wise optimization, which treats each transformer block as a single layer and optimizes it sequentially, while freezing the other layers to save optimizer states and activations. Although effective, these methods ignore the varying importance of the modules within each layer, leading to suboptimal performance. Moreover, layer-wise sampling provides only limited memory savings, as at least one full layer must remain active during optimization. To overcome these limitations, we propose Module-wise Importance SAmpling (MISA), a novel method that divides each layer into smaller modules and assigns importance scores to each module. MISA uses a weighted random sampling mechanism to activate modules, provably reducing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification
