Compressing Large Language Models with Automated Sub-Network Search

Rhea Sanjay Sukthanker; Benedikt Staffler; Frank Hutter; Aaron Klein

arXiv:2410.06479·cs.CL·February 6, 2025

Compressing Large Language Models with Automated Sub-Network Search

Rhea Sanjay Sukthanker, Benedikt Staffler, Frank Hutter, Aaron Klein

PDF

Open Access

TL;DR

This paper introduces an automated neural architecture search method to prune large language models, reducing size and latency while improving task performance, addressing the high inference costs of scaling LLMs.

Contribution

It presents a novel automated sub-network search approach that optimizes LLM pruning for better performance and efficiency, outperforming existing structural pruning methods.

Findings

01

Up to 9.85% performance improvement on downstream tasks

02

Up to 22% reduction in on-device latency

03

Outperforms state-of-the-art pruning methods

Abstract

Large Language Models (LLMs) demonstrate exceptional reasoning abilities, enabling strong generalization across diverse tasks such as commonsense reasoning and instruction following. However, as LLMs scale, inference costs become increasingly prohibitive, accumulating significantly over their life cycle. In this paper we consider model compression for LLMs to reduce model size while improving downstream task performance. We phrase this as a neural architecture search problem that automatically prunes structural components, such as attention heads, neurons, and layers by searching for the Pareto-optimal set of sub-networks balancing between performance and on-device latency. Compared to state-of-the-art structural pruning approaches and fine-tuned smaller sub-networks extracted from the pre-trained model, our method achieves upto 9.85% improvement on average on 11 diverse downstream…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression

MethodsSoftmax · Attention Is All You Need · Pruning