SPT: Fine-Tuning Transformer-based Language Models Efficiently with Sparsification
Yuntao Gui, Xiao Yan, Peiqi Yin, Han Yang, James Cheng

TL;DR
This paper introduces SPT, a system that enhances the efficiency of fine-tuning large transformer models by employing sparsification techniques, significantly reducing memory usage and speeding up training.
Contribution
The paper presents novel sparse modules for attention and feed-forward networks, enabling efficient transformer fine-tuning with reduced resource consumption.
Findings
Peak memory consumption reduced by up to 50%
Fine-tuning speed increased by up to 2.2x
Consistent performance improvements over baselines
Abstract
Transformer-based large language models (e.g., BERT and GPT) achieve great success, and fine-tuning, which tunes a pre-trained model on a task-specific dataset, is the standard practice to utilize these models for downstream tasks. However, Transformer fine-tuning has long running time and high memory consumption due to the large size of the models. We propose the SPT system to fine-tune Transformer-based models efficiently by introducing sparsity. We observe that the memory consumption of Transformer mainly comes from storing attention weights for multi-head attention (MHA), and the majority of running time is spent on feed-forward network (FFN). Thus, we design the sparse MHA module, which computes and stores only large attention weights to reduce memory consumption, and the routed FFN module, which dynamically activates a subset of model parameters for each token to reduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Adam
