SparseGrad: A Selective Method for Efficient Fine-tuning of MLP Layers
Viktoriia Chekalina, Anna Rudenko, Gleb Mezentsev, Alexander Mikhalev,, Alexander Panchenko, Ivan Oseledets

TL;DR
SparseGrad is a novel parameter-efficient fine-tuning method that selectively updates MLP layers in Transformer models by sparsifying gradients, leading to better performance with less memory usage.
Contribution
It introduces SparseGrad, a new gradient sparsification technique for MLP blocks, improving fine-tuning efficiency and effectiveness over existing PEFT methods.
Findings
Outperforms LoRA and MeProp on BERT, RoBERTa, and LLaMa-2 tasks.
Reduces parameter updates to about 1% of layer elements.
Achieves better results with identical memory constraints.
Abstract
The performance of Transformer models has been enhanced by increasing the number of parameters and the length of the processed text. Consequently, fine-tuning the entire model becomes a memory-intensive process. High-performance methods for parameter-efficient fine-tuning (PEFT) typically work with Attention blocks and often overlook MLP blocks, which contain about half of the model parameters. We propose a new selective PEFT method, namely SparseGrad, that performs well on MLP blocks. We transfer layer gradients to a space where only about 1\% of the layer's elements remain significant. By converting gradients into a sparse structure, we reduce the number of updated parameters. We apply SparseGrad to fine-tune BERT and RoBERTa for the NLU task and LLaMa-2 for the Question-Answering task. In these experiments, with identical memory requirements, our method outperforms LoRA and MeProp,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDNA and Biological Computing · Cellular Automata and Applications · Photonic and Optical Devices
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dense Connections · WordPiece · Residual Connection · Position-Wise Feed-Forward Layer · Adam · Attention Dropout · Linear Layer · Weight Decay
