IntraSlice: Towards High-Performance Structural Pruning with Block-Intra PCA for LLMs
Meng Li, Peisong Wang, Yuantian Shao, Qinghao Hu, Hongjian Fang, Yifan Zhang, Zhihui Wei, Jian Cheng

TL;DR
IntraSlice introduces a block-wise PCA-based pruning method for LLMs that preserves model performance while significantly reducing size and inference time, by fully fusing transformation matrices without extra parameters.
Contribution
The paper proposes a novel intra-module PCA compression technique with a fully fusable transformation matrix and a global pruning estimator, improving structured pruning for LLMs.
Findings
Achieves higher compression with less performance loss.
Demonstrates superior results on Llama2, Llama3, and Phi models.
Outperforms recent baselines at equivalent compression ratios.
Abstract
Large Language Models (LLMs) achieve strong performance across diverse tasks but face deployment challenges due to their massive size. Structured pruning offers acceleration benefits but leads to significant performance degradation. Recent PCA-based pruning methods have alleviated this issue by retaining key activation components, but are only applied between modules in order to fuse the transformation matrix, which introduces extra parameters and severely disrupts activation distributions due to residual connections. To address these issues, we propose IntraSlice, a framework that applies block-wise module-intra PCA compression pruning. By leveraging the structural characteristics of Transformer modules, we design an approximate PCA method whose transformation matrices can be fully fused into the model without additional parameters. We also introduce a PCA-based global pruning ratio…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Neural Network Applications · Multimodal Machine Learning Applications
