NutePrune: Efficient Progressive Pruning with Numerous Teachers for   Large Language Models

Shengrui Li; Junzhe Chen; Xueting Han; Jing Bai

arXiv:2402.09773·cs.CL·June 28, 2024·1 cites

NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models

Shengrui Li, Junzhe Chen, Xueting Han, Jing Bai

PDF

Open Access 1 Repo

TL;DR

NutePrune is a resource-efficient progressive pruning method for large language models that uses multiple teachers and LoRA modules to achieve high performance at significant sparsity levels.

Contribution

It introduces an efficient multi-teacher pruning approach that reduces memory costs and improves pruning effectiveness for large language models.

Findings

01

Retains 97.17% of original performance at 20% sparsity

02

Achieves 95.07% performance at 25% sparsity

03

Demonstrates effectiveness across various tasks

Abstract

The considerable size of Large Language Models (LLMs) presents notable deployment challenges, particularly on resource-constrained hardware. Structured pruning, offers an effective means to compress LLMs, thereby reducing storage costs and enhancing inference speed for more efficient utilization. In this work, we study data-efficient and resource-efficient structure pruning methods to obtain smaller yet still powerful models. Knowledge Distillation is well-suited for pruning, as the intact model can serve as an excellent teacher for pruned students. However, it becomes challenging in the context of LLMs due to memory constraints. To address this, we propose an efficient progressive Numerous-teacher pruning method (NutePrune). NutePrune mitigates excessive memory costs by loading only one intact model and integrating it with various masks and LoRA modules, enabling it to seamlessly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lucius-lsr/nuteprune
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsPruning · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Knowledge Distillation