MGAA: Multi-Granular Adaptive Allocation fof Low-Rank Compression of LLMs
Guangyan Li, Yongqiang Tang, Wensheng Zhang

TL;DR
This paper introduces MGAA, a novel adaptive parameter allocation method for low-rank compression of large language models, which improves efficiency by tailoring compression ratios based on sublayer importance and energy distribution.
Contribution
MGAA adaptively allocates compression parameters across and within sublayers without task-specific tuning, enhancing model compression effectiveness for LLMs.
Findings
MGAA outperforms existing uniform compression methods.
It achieves better energy retention and model performance.
Demonstrated effectiveness on multiple LLMs and multimodal models.
Abstract
The enormous parameter scale of large language models (LLMs) has made model compression a research hotspot, which aims to alleviate computational resource demands during deployment and inference. As a promising direction, low-rank approximation technique has made remarkable achievements. Nevertheless, unfortunately, the vast majority of studies to low-rank approximation compression generally apply uniform compression ratios across all weight matrices, while disregarding their inherently differentiated impacts on the model's performance. Although a few recent work attempts to employ heuristic search strategies to achieve the optimal parameter allocation, such strategies are computationally inefficient and lose the generalization ability in the era of LLMs. In this study, we propose a novel parameter Multi-Granular Adaptive Allocation (MGAA) method, which can adaptively allocate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
