SPT: Fine-Tuning Transformer-based Language Models Efficiently with   Sparsification

Yuntao Gui; Xiao Yan; Peiqi Yin; Han Yang; James Cheng

arXiv:2312.10365·cs.DC·December 19, 2023·1 cites

SPT: Fine-Tuning Transformer-based Language Models Efficiently with Sparsification

Yuntao Gui, Xiao Yan, Peiqi Yin, Han Yang, James Cheng

PDF

Open Access 1 Repo

TL;DR

This paper introduces SPT, a system that enhances the efficiency of fine-tuning large transformer models by employing sparsification techniques, significantly reducing memory usage and speeding up training.

Contribution

The paper presents novel sparse modules for attention and feed-forward networks, enabling efficient transformer fine-tuning with reduced resource consumption.

Findings

01

Peak memory consumption reduced by up to 50%

02

Fine-tuning speed increased by up to 2.2x

03

Consistent performance improvements over baselines

Abstract

Transformer-based large language models (e.g., BERT and GPT) achieve great success, and fine-tuning, which tunes a pre-trained model on a task-specific dataset, is the standard practice to utilize these models for downstream tasks. However, Transformer fine-tuning has long running time and high memory consumption due to the large size of the models. We propose the SPT system to fine-tune Transformer-based models efficiently by introducing sparsity. We observe that the memory consumption of Transformer mainly comes from storing attention weights for multi-head attention (MHA), and the majority of running time is spent on feed-forward network (FFN). Thus, we design the sparse MHA module, which computes and stores only large attention weights to reduce memory consumption, and the routed FFN module, which dynamically activates a subset of model parameters for each token to reduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ytgui/spt-proto
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Neural Network Applications

MethodsMulti-Head Attention · Attention Is All You Need · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Adam