ARA: Adaptive Rank Allocation for Efficient Large Language Model SVD Compression
Lin Xv, Jingsheng Gao, Xian Gao, Ting Liu, Yuzhuo Fu

TL;DR
This paper introduces ARA, a novel method for adaptive rank allocation in SVD-based large language model compression, improving performance and efficiency by addressing limitations of previous approaches.
Contribution
ARA provides a new mask design and loss function to optimize rank allocation, achieving state-of-the-art results in LLM compression.
Findings
Reduces perplexity on WikiText2 from 8.38 to 6.42 at 80% compression
Improves zero-shot task accuracy by 9.72 percentage points
Outperforms existing heuristic and mask-based methods
Abstract
In the field of large language model (LLM) compression, singular value decomposition (SVD) is a widely studied and adopted low-rank decomposition technique. Since SVD operates exclusively on linear modules, and these modules in LLMs are separated by nonlinear components, SVD can only be applied independently to each linear module. Under a global compression ratio constraint, determining the appropriate rank for different linear modules becomes a critical problem. Existing approaches, such as heuristic algorithms and mask-based training, have made progress in addressing this challenge. However, these methods still suffer from several limitations: heuristic algorithms explore the solution space within restricted regions, while mask-based training struggles to efficiently capture the relationship between singular value spectra and trainable parameters. More importantly, current methods…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
* The rank-selection problem itself is important and worth studying. * The scenario discussed in Section 3.3 where compression actually offers no gain and and the method should fall back to the full-rank weight is also an important but sometimes under discussed topic. * The paper also shows that the proposed approach is orthogonal to other techniques such as quantization.
* The method section is not very well presented. I would suggest to improve the clarity for Section 3.2 and consder bringing Algorithm 1 into the main text or explicitly referencing it in the main text. * The paper keeps saying prior mask methods get stuck in local minima and ARA “reaches the global optimum” (Fig. 2), but what they actually do is add a simple guidance loss term, this does not represent a general optimality result. This statement should consider to be toned down. * There is no ef
- A simple, monotone mask construction that avoids vanishing gradients and enforces preference for larger singular values; the STE makes training match inference. - Competitive numbers across multiple models and tasks, with tables showing consistent improvements over prior SVD rank-allocation methods.
- Incremental conceptual novelty. The core ideas that monotone masks over singular spectra and an auxiliary term that favors keeping full rank when low-rank is inefficient—extend well-known SVD truncation and mask-learning practices. The “guidance loss” formalizes a heuristic already implicit in prior pipelines that skip SVD when k(m+n)>mn. The theoretical treatment does not go beyond standard truncation analyses. - Claims target “efficient LLMs,” yet there is no end-to-end latency/throughput
- Converts the discrete rank-selection problem into a continuous, simplex-constrained optimisation that is easy to train with standard back-propagation. - The proposed staircase-mapping mask ensures monotonicity with respect to singular values, preserving the theoretical optimality of the Eckart–Young theorem and avoiding the instability of Gumbel-Sigmoid or tanh masks. - The framework remains orthogonal to pruning and quantization, and can be combined with both for further efficiency gains.
- The monotonic mask enforces that singular values with larger magnitudes are always prioritized. As a result, ARA cannot invert or locally re-weight the importance of individual singular directions—only adjust the global cutoff boundary. This restricts its ability to capture task-critical, low-energy directions that matter more to downstream performance than to reconstruction error. - The optimization objective remains dominated by spectral energy preservation and cross-entropy loss. It does no
1. Strong Empirical Result. 2. Novel Treatment of R≥1 Case: The guidance loss Lg and dynamic computational flow (Equation 8) address a genuine problem that prior mask-based methods overlook. This is a valuable contribution. 3. Improved Mask Design: The staircase binary matrix parameterization ensures monotonicity while avoiding vanishing gradients (unlike tanh-based masks) and maintaining global receptive field (unlike Gumbel-Sigmoid masks). The technical design is sound.
1. Missing Baselines and Citations. The paper should cite the following two papers in both literature review and experimental comparisons: 1) TFWSVD—EMNLP 2022, titled "Numerical Optimizations for Weighted Low-rank Estimation on Language Model 2) RankDyna— EMNLP 2023 findings, Dynamic Low-rank Estimation for Transformer-based Language Models TFWSVD is the follow-up of FWSVD, in a more accurate way. More importantly, that paper proposes a Fisher information variance metric φ(W) that predicts w
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
