Large Language Model Compression with Global Rank and Sparsity Optimization
Changhai Zhou, Qian Qiao, Yuhua Zhou, Yuxin Wu, Shichao Weng, Weizhong Zhang, Cheng Jin

TL;DR
This paper introduces a two-stage global resource allocation method for compressing large language models by combining low-rank and sparse approximations, addressing layer redundancy and interaction challenges.
Contribution
It proposes a novel two-stage approach using PCA and probabilistic strategies for joint low-rank and sparse optimization with global resource management.
Findings
Significantly outperforms existing sparsification methods
Automatically detects layer redundancy
Effectively manages low-rank and sparse interactions
Abstract
Low-rank and sparse composite approximation is a natural idea to compress Large Language Models (LLMs). However, such an idea faces two primary challenges that adversely affect the performance of existing methods. The first challenge relates to the interaction and cooperation between low-rank and sparse matrices, while the second involves determining weight allocation across different layers, as redundancy varies considerably among them. To address these challenges, we propose a novel two-stage LLM compression method with the capability of global resource allocation for rank and sparsity. It is noteworthy that the overall optimization space is vast, making comprehensive optimization computationally prohibitive. Therefore, to reduce the optimization space, our first stage utilizes robust principal component analysis to decompose the weight matrices of LLMs into low-rank and sparse…
Peer Reviews
Decision·ICLR 2026 Poster
1. The writing is clear and easy to follow. The appendix is a good preliminary for relevant techniques. 2. This work provides enough details for reproducibility. 3. Extensive ablation and analysis, including in the main content and Appendix K, show meaningful insights.
1. While the paper reports zero-shot accuracy on eight benchmarks and perplexity on WikiText-2, it would be valuable to include tasks that require chain-of-thought or longer generations, such as those with more than 100+ tokens. Prior work has observed that models can retain perplexity and short-form QA accuracy, but degrade more sharply as generation lengths increase.
- The paper presents a conceptually unified view of low-rank and sparse compression and emphasizes global allocation of redundancy across layers. - The Bernoulli-policy optimization introduces a probabilistic pruning mechanism that is simple and can be trained without full fine-tuning.
- The proposed CAP framework largely combines existing elements, RPCA-based decomposition, SVD thresholding, and REINFORCE optimization, without a fundamentally new algorithmic insight. The combination is technically straightforward and primarily an engineering integration of known methods. - Reported “no fine-tuning” performance comes at the cost of multiple RPCA and policy-gradient passes on calibration data. The wall-clock savings over simpler pruning or quantization approaches are not measu
1. Principled search-space reduction: RPCA to form high-quality candidate subspaces before budgeted selection is elegant and well motivated. 2. Global, budget-aware allocation: Bernoulli masking with policy gradients ties rank and sparsity to a single parameter budget K, avoiding heuristic thresholds and per-layer guessing. 3. Consistent empirical gains: Tables show competitive or superior performance to strong sparsifiers (SparseGPT/Wanda/OATS) at 30–50% compression; 4. Reproducibility: Impl
1. Missing comparison with LoSparse: The paper explicitly points out that LoSparse suffers from manually selected ranks and lack of global coordination, yet the main experimental tables do not include LoSparse results. 2. Lack of comparison with recent low-rank methods: To validate the claimed advantage of RPCA-based decomposition, the paper should include results on SVD-LLM v2(https://arxiv.org/abs/2503.12340), Basis Sharing(https://openreview.net/pdf?id=gp32jvUquq), and Dobi-SVD(https://openr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Stochastic Gradient Optimization Techniques · Big Data and Digital Economy
