MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models
Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi

TL;DR
MINI-LLM introduces a memory-efficient structured pruning method for large language models that combines magnitude, activation, and estimated gradient information to effectively prune channels and attention heads without high memory costs.
Contribution
The paper presents a hybrid pruning criterion and a gradient estimation technique that enables memory-efficient pruning of LLMs, outperforming existing gradient-free methods.
Findings
MINI-LLM achieves superior pruning performance on LLaMA, BLOOM, and OPT.
It maintains low GPU memory usage comparable to gradient-free methods.
Pruned models retain high accuracy across various downstream tasks.
Abstract
As Large Language Models (LLMs) grow dramatically in size, there is an increasing trend in compressing and speeding up these models. Previous studies have highlighted the usefulness of gradients for importance scoring in neural network compressing, especially in pruning medium-size networks. However, the substantial memory requirements involved in calculating gradients with backpropagation impede the utilization of gradients in guiding LLM pruning. As a result, most pruning strategies for LLMs rely on gradient-free criteria, such as weight magnitudes or a mix of magnitudes and activations. In this paper, we devise a hybrid pruning criterion, which appropriately integrates magnitude, activation, and gradient to capitalize on feature map sensitivity for pruning LLMs. To overcome memory requirement barriers, we estimate gradients using only forward passes. Based on this, we propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsBLOOM · LLaMA · OPT · Pruning
