Sparse Gradient Compression for Fine-Tuning Large Language Models
David H. Yang, Mohammad Mohammadi Amiri, Tejaswini Pedapati, Subhajit, Chaudhury, Pin-Yu Chen

TL;DR
This paper introduces sparse gradient compression (SGC), a novel method that leverages gradient sparsity to reduce memory usage during fine-tuning of large language models, achieving better efficiency and performance.
Contribution
SGC is a new training regime that compresses optimizer states by projecting gradients onto low-dimensional subspaces, offering flexible memory-performance tradeoffs during LLM fine-tuning.
Findings
SGC reduces optimizer memory more effectively than existing PEFT methods.
SGC maintains or improves fine-tuning performance across various tasks.
SGC is especially beneficial in data-limited and memory-limited scenarios.
Abstract
Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models. However, the high memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size. To address this, parameter efficient fine-tuning (PEFT) methods have been proposed to minimize the number of parameters required for fine-tuning LLMs. However, these approaches often tie the number of optimizer states to dimensions of model parameters, limiting flexibility and control during fine-tuning. In this paper, we propose sparse gradient compression (SGC), a training regime designed to address these limitations. Our approach leverages inherent sparsity in gradients to compress optimizer states by projecting them onto a low-dimensonal subspace, with dimensionality independent of…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1.The paper introduces two algorithms, MESGC and CESGC, for effectively reducing the memory and computational complexity, respectively. 2.It presents a well-reasoned approach for determining the hyperparameters of the SGC algorithm in Section 5.2.
1.The novelty of the proposed approach is limited. The concept of projecting optimizer states into a subspace with a dimension independent of the original model size has been previously discussed in the top-k compressor as shown in [1] and [2]. 2.The paper lacks a theoretical analysis of the relationship between the choice of k and the model’s convergence, a detail that has been explored in [1] and [2]. 3.The idea behind SGC lacks novelty, as both algorithms are quite similar to GaLore. Addition
1. SGC is more flexible compared to previous methods like LoRA and GaLore, allowing more granular control over the dimensionality of the compressed optimizer state. 2. On commonsense benchmarks, SGC achieves a comparable average accuracy to both GaLore and LoRA while using fewer optimizer state parameters.
1. Although SGC is more flexible, this advantage is somewhat marginal, as LoRA and GaLore are already quite flexible. 2. The paper lacks throughput experiments and runtime analysis of OMP. 3. It does not include empirical experiments comparing memory usage of SGC and baseline methods to validate the theoretical analysis. 4. There is no information on the error magnitude after gradient compression. 5. Mischaracterization in lines 350-352 and 368-369, where it mentions GaLore as a type of PEFT met
1. The writing is clear and well-structured, with an appropriate balance of detail, making it easy to understand. Most technical choices are well-motivated and thoroughly explained. 2. The motivation behind the proposed SGC method is intuitive, making the approach conceptually accessible and logical given the challenges in fine-tuning large language models. 3. The experiments and comparative analyses effectively demonstrate that SGC offers memory savings while maintaining comparable performance,
1. **Limited Applicability**: While the paper claims that SGC offers a more flexible, fine-grained tradeoff, PEFT methods typically target compute-constrained scenarios, where such granular control may require extra tuning that reduces practicality. It would be beneficial to include a plot with sparsity on the x-axis and performance on the y-axis to directly compare the flexibility of SGC with LoRA. This visualization could more intuitively demonstrate whether SGC’s fine-grained control offers p
1. The paper presents a novel approach that addresses memory efficiency in large-scale fine-tuning tasks. The proposed approach enables more flexible and granular control over the number of parameters to train during finetuning. 2. Experimental evaluation shows that SGC competes well with and often outperforms existing methods (e.g., LoRA, GaLore) in terms of memory efficiency and accuracy.
1. The author highlights limitations in the flexibility and granularity of LoRA due to the dependency on model dimensions. However, in practical applications, these constraints may not significantly impact performance. Many real-world tasks do not require extreme reductions in trainable parameters, and the existing flexibility of LoRA is often sufficient. For instance, as shown in Table 2, LoRA fine-tunes only 0.2% of the parameters, meaning the LoRA weights and optimizer states are not the bott
- The presentation is clear and easy to follow, with only a few minor typos. - The proposed sparse gradient method is straightforward and supported by reasonable theoretical foundations.
- **Dataset Limitation**: The authors only use a single dataset (Commonsense) in the experimental sections. I strongly recommend adding at least one more dataset to demonstrate the generalizability of the algorithm across different data domains. - **Comparison to LoRA in Speed**: While SGC effectively reduces optimizer memory costs similar to LoRA, LoRA offers additional advantages by significantly speeding up the fine-tuning process. Through low-rank adapters and fewer trainable parameters, Lo
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Algorithms and Data Compression
