Compressing Large Language Models with Automated Sub-Network Search
Rhea Sanjay Sukthanker, Benedikt Staffler, Frank Hutter, Aaron Klein

TL;DR
This paper introduces an automated neural architecture search method to prune large language models, reducing size and latency while improving task performance, addressing the high inference costs of scaling LLMs.
Contribution
It presents a novel automated sub-network search approach that optimizes LLM pruning for better performance and efficiency, outperforming existing structural pruning methods.
Findings
Up to 9.85% performance improvement on downstream tasks
Up to 22% reduction in on-device latency
Outperforms state-of-the-art pruning methods
Abstract
Large Language Models (LLMs) demonstrate exceptional reasoning abilities, enabling strong generalization across diverse tasks such as commonsense reasoning and instruction following. However, as LLMs scale, inference costs become increasingly prohibitive, accumulating significantly over their life cycle. In this paper we consider model compression for LLMs to reduce model size while improving downstream task performance. We phrase this as a neural architecture search problem that automatically prunes structural components, such as attention heads, neurons, and layers by searching for the Pareto-optimal set of sub-networks balancing between performance and on-device latency. Compared to state-of-the-art structural pruning approaches and fine-tuned smaller sub-networks extracted from the pre-trained model, our method achieves upto 9.85% improvement on average on 11 diverse downstream…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression
MethodsSoftmax · Attention Is All You Need · Pruning
