SparseGrad: A Selective Method for Efficient Fine-tuning of MLP Layers

Viktoriia Chekalina; Anna Rudenko; Gleb Mezentsev; Alexander Mikhalev,; Alexander Panchenko; Ivan Oseledets

arXiv:2410.07383·cs.CL·October 11, 2024

SparseGrad: A Selective Method for Efficient Fine-tuning of MLP Layers

Viktoriia Chekalina, Anna Rudenko, Gleb Mezentsev, Alexander Mikhalev,, Alexander Panchenko, Ivan Oseledets

PDF

Open Access 1 Repo 1 Video

TL;DR

SparseGrad is a novel parameter-efficient fine-tuning method that selectively updates MLP layers in Transformer models by sparsifying gradients, leading to better performance with less memory usage.

Contribution

It introduces SparseGrad, a new gradient sparsification technique for MLP blocks, improving fine-tuning efficiency and effectiveness over existing PEFT methods.

Findings

01

Outperforms LoRA and MeProp on BERT, RoBERTa, and LLaMa-2 tasks.

02

Reduces parameter updates to about 1% of layer elements.

03

Achieves better results with identical memory constraints.

Abstract

The performance of Transformer models has been enhanced by increasing the number of parameters and the length of the processed text. Consequently, fine-tuning the entire model becomes a memory-intensive process. High-performance methods for parameter-efficient fine-tuning (PEFT) typically work with Attention blocks and often overlook MLP blocks, which contain about half of the model parameters. We propose a new selective PEFT method, namely SparseGrad, that performs well on MLP blocks. We transfer layer gradients to a space where only about 1\% of the layer's elements remain significant. By converting gradients into a sparse structure, we reduce the number of updated parameters. We apply SparseGrad to fine-tune BERT and RoBERTa for the NLU task and LLaMa-2 for the Question-Answering task. In these experiments, with identical memory requirements, our method outperforms LoRA and MeProp,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sayankotor/sparse_grads
pytorchOfficial

Videos

SparseGrad: A Selective Method for Efficient Fine-tuning of MLP Layers· underline

Taxonomy

TopicsDNA and Biological Computing · Cellular Automata and Applications · Photonic and Optical Devices

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dense Connections · WordPiece · Residual Connection · Position-Wise Feed-Forward Layer · Adam · Attention Dropout · Linear Layer · Weight Decay