NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models
Shengrui Li, Junzhe Chen, Xueting Han, Jing Bai

TL;DR
NutePrune is a resource-efficient progressive pruning method for large language models that uses multiple teachers and LoRA modules to achieve high performance at significant sparsity levels.
Contribution
It introduces an efficient multi-teacher pruning approach that reduces memory costs and improves pruning effectiveness for large language models.
Findings
Retains 97.17% of original performance at 20% sparsity
Achieves 95.07% performance at 25% sparsity
Demonstrates effectiveness across various tasks
Abstract
The considerable size of Large Language Models (LLMs) presents notable deployment challenges, particularly on resource-constrained hardware. Structured pruning, offers an effective means to compress LLMs, thereby reducing storage costs and enhancing inference speed for more efficient utilization. In this work, we study data-efficient and resource-efficient structure pruning methods to obtain smaller yet still powerful models. Knowledge Distillation is well-suited for pruning, as the intact model can serve as an excellent teacher for pruned students. However, it becomes challenging in the context of LLMs due to memory constraints. To address this, we propose an efficient progressive Numerous-teacher pruning method (NutePrune). NutePrune mitigates excessive memory costs by loading only one intact model and integrating it with various masks and LoRA modules, enabling it to seamlessly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsPruning · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Knowledge Distillation
