S$^{2}$FT: Efficient, Scalable and Generalizable LLM Fine-tuning by   Structured Sparsity

Xinyu Yang; Jixuan Leng; Geyang Guo; Jiawei Zhao; Ryumei Nakada,; Linjun Zhang; Huaxiu Yao; Beidi Chen

arXiv:2412.06289·cs.LG·December 20, 2024

S$^{2}$FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity

Xinyu Yang, Jixuan Leng, Geyang Guo, Jiawei Zhao, Ryumei Nakada,, Linjun Zhang, Huaxiu Yao, Beidi Chen

PDF

Open Access

TL;DR

S$^{2}$FT introduces a structured sparse fine-tuning approach for large language models that enhances performance, efficiency, and scalability by selectively updating submatrices, achieving state-of-the-art results across multiple tasks.

Contribution

The paper proposes a novel structured sparse fine-tuning method that improves generalization, efficiency, and scalability of LLMs, outperforming existing PEFT methods and full fine-tuning.

Findings

01

Achieves state-of-the-art performance on reasoning tasks.

02

Reduces training memory by up to 3 times.

03

Improves inference latency by 1.5-2.7 times.

Abstract

Current PEFT methods for LLMs can achieve either high quality, efficient training, or scalable serving, but not all three simultaneously. To address this limitation, we investigate sparse fine-tuning and observe a remarkable improvement in generalization ability. Utilizing this key insight, we propose a family of Structured Sparse Fine-Tuning (S $^{2}$ FT) methods for LLMs, which concurrently achieve state-of-the-art fine-tuning performance, training efficiency, and inference scalability. S $^{2}$ FT accomplishes this by "selecting sparsely and computing densely". It selects a few heads and channels in the MHA and FFN modules for each Transformer block, respectively. Next, it co-permutes weight matrices on both sides of the coupled structures in LLMs to connect the selected components in each layer into a dense submatrix. Finally, S $^{2}$ FT performs in-place gradient updates on all…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParticle accelerators and beam dynamics · Electromagnetic Simulation and Numerical Methods · Particle Accelerators and Free-Electron Lasers

MethodsAttention Is All You Need · Adam · Dropout · Position-Wise Feed-Forward Layer · Softmax · Dense Connections · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Label Smoothing