Olica: Efficient Structured Pruning of Large Language Models without Retraining
Jiujun He, Huazhen Lin

TL;DR
Olica is a novel pruning framework for large language models that removes the need for retraining by using PCA and linear calibration, significantly reducing computational costs while maintaining model accuracy.
Contribution
It introduces a retraining-free structured pruning method for LLMs using PCA and SVD, improving efficiency and preserving model performance.
Findings
Reduces pruning complexity by a factor of the square of attention heads
Maintains accuracy without retraining across multiple benchmarks
Uses linear calibration to mitigate error accumulation in pruned models
Abstract
Most existing structured pruning methods for Large Language Models (LLMs) require substantial computational and data resources for retraining to reestablish the corrupted correlations, making them prohibitively expensive. To address this, we propose a pruning framework for LLMs called Orthogonal decomposition and Linear Calibration (Olica), which eliminates the need for retraining. A key observation is that the multi-head attention (MHA) layer depends on two types of matrix products. By treating these matrix products as unified entities and applying principal component analysis (PCA), we extract the most important information to compress LLMs without sacrificing accuracy or disrupting their original structure. Consequently, retraining becomes unnecessary. A fast decomposition method is devised, reducing the complexity of PCA by a factor of the square of the number of attention heads.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Big Data and Digital Economy
MethodsSoftmax · Linear Layer · Attention Is All You Need · Principal Components Analysis · Multi-Head Attention · Pruning
