FlexiGPT: Pruning and Extending Large Language Models with Low-Rank   Weight Sharing

James Seale Smith; Chi-Heng Lin; Shikhar Tuli; Haris Jeelani,; Shangqian Gao; Yilin Shen; Hongxia Jin; Yen-Chang Hsu

arXiv:2501.14713·cs.CL·February 3, 2025

FlexiGPT: Pruning and Extending Large Language Models with Low-Rank Weight Sharing

James Seale Smith, Chi-Heng Lin, Shikhar Tuli, Haris Jeelani,, Shangqian Gao, Yilin Shen, Hongxia Jin, Yen-Chang Hsu

PDF

Open Access 1 Video

TL;DR

FlexiGPT introduces a novel pruning and extension technique for large language models that uses low-rank weight sharing to maintain performance while reducing model size, enabling efficient deployment on constrained devices.

Contribution

The paper presents a new method combining importance-based pruning with low-rank weight sharing to improve LLM compression and extension capabilities.

Findings

01

Achieves state-of-the-art performance on 5/6 benchmarks at 30% compression.

02

Maintains high performance on all 6 benchmarks at 40% compression.

03

Extends smaller models effectively with minimal additional training and parameters.

Abstract

The rapid proliferation of large language models (LLMs) in natural language processing (NLP) has created a critical need for techniques that enable efficient deployment on memory-constrained devices without compromising performance. We present a method to prune LLMs that selectively prunes model blocks based on an importance score and replaces them with a low-parameter replacement strategy. Specifically, we propose a principled metric to replace each pruned block using a weight-sharing mechanism that leverages unpruned counterparts from the model and block-specific low-rank adapters. Furthermore, we facilitate the learning of these replacement blocks with output feature normalization and an adapter initialization scheme built on low-rank SVD reconstructions. Empirical evaluations demonstrate substantial performance gains over existing methods, achieving state-of-the-art performance on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

FlexiGPT: Pruning and Extending Large Language Models with Low-Rank Weight Sharing· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsAdapter