MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models

Hongrong Cheng; Miao Zhang; Javen Qinfeng Shi

arXiv:2407.11681·cs.CL·July 17, 2024

MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models

Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi

PDF

Open Access

TL;DR

MINI-LLM introduces a memory-efficient structured pruning method for large language models that combines magnitude, activation, and estimated gradient information to effectively prune channels and attention heads without high memory costs.

Contribution

The paper presents a hybrid pruning criterion and a gradient estimation technique that enables memory-efficient pruning of LLMs, outperforming existing gradient-free methods.

Findings

01

MINI-LLM achieves superior pruning performance on LLaMA, BLOOM, and OPT.

02

It maintains low GPU memory usage comparable to gradient-free methods.

03

Pruned models retain high accuracy across various downstream tasks.

Abstract

As Large Language Models (LLMs) grow dramatically in size, there is an increasing trend in compressing and speeding up these models. Previous studies have highlighted the usefulness of gradients for importance scoring in neural network compressing, especially in pruning medium-size networks. However, the substantial memory requirements involved in calculating gradients with backpropagation impede the utilization of gradients in guiding LLM pruning. As a result, most pruning strategies for LLMs rely on gradient-free criteria, such as weight magnitudes or a mix of magnitudes and activations. In this paper, we devise a hybrid pruning criterion, which appropriately integrates magnitude, activation, and gradient to capitalize on feature map sensitivity for pruning LLMs. To overcome memory requirement barriers, we estimate gradients using only forward passes. Based on this, we propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsBLOOM · LLaMA · OPT · Pruning