Sparse is Enough in Fine-tuning Pre-trained Large Language Models

Weixi Song; Zuchao Li; Lefei Zhang; Hai Zhao; Bo Du

arXiv:2312.11875·cs.LG·June 11, 2024·1 cites

Sparse is Enough in Fine-tuning Pre-trained Large Language Models

Weixi Song, Zuchao Li, Lefei Zhang, Hai Zhao, Bo Du

PDF

Open Access 1 Repo

TL;DR

This paper introduces SIFT, a gradient-based sparse fine-tuning method for large language models, grounded in PAC-Bayesian theory, demonstrating improved efficiency and generalization across multiple NLP tasks.

Contribution

It proposes a novel sparse fine-tuning algorithm, SIFT, based on PAC-Bayesian bounds and loss landscape analysis, advancing parameter-efficient adaptation techniques.

Findings

01

SIFT achieves competitive performance on GLUE and instruction-tuning tasks.

02

Theoretical analysis links pre-training as prior shift to generalization bounds.

03

Empirical results validate the effectiveness of sparse fine-tuning in practice.

Abstract

With the prevalence of pre-training-fine-tuning paradigm, how to efficiently adapt the pre-trained model to the downstream tasks has been an intriguing issue. Parameter-Efficient Fine-Tuning (PEFT) methods have been proposed for low-cost adaptation. Although PEFT has demonstrated effectiveness and been widely applied, the underlying principles are still unclear. In this paper, we adopt the PAC-Bayesian generalization error bound, viewing pre-training as a shift of prior distribution which leads to a tighter bound for generalization error. We validate this shift from the perspectives of oscillations in the loss landscape and the quasi-sparsity in gradient distribution. Based on this, we propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT), and validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

song-wx/sift
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Domain Adaptation and Few-Shot Learning