CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention
Zhongzhu Zhou, Fengxiang Bie, Ziyan Chen, Zhenyu Zhang, Yibo Yang, Junxiong Wang, Ben Athiwaratkun, Xiaoxia Wu, Shuaiwen Leon Song

TL;DR
This paper introduces CARE, a novel covariance-aware and rank-enhanced method for converting pretrained attention modules into multi-head latent attention, significantly improving efficiency and fidelity in large language models.
Contribution
CARE presents a new conversion pipeline that aligns approximations with input activations, optimally allocates ranks across layers, and reparameterizes keys and values, outperforming existing methods.
Findings
Reduces perplexity by up to 215x on large models
Improves mean accuracy by up to 1.70x at fixed KV budgets
Fully recovers original model accuracy after brief fine-tuning
Abstract
Converting pretrained attention modules such as grouped-query attention (GQA) into multi-head latent attention (MLA) can improve expressivity without increasing KV-cache cost, making it attractive for efficient inference. However, many practical conversion baselines rely on weight-only low-rank approximations (e.g., SVD-style initializations) and uniform rank allocation. They focus on minimizing the difference between weight matrices rather than on how those weights affect input activations, ignore the covariance structure of activations, and enforce uniform rank across layers, causing activation drift and degraded attention fidelity. To address these issues, we propose CARE, a Covariance-Aware, Rank-Enhanced MLA conversion pipeline under a fixed KV width. CARE introduces three key steps: (i) activation-preserving factorization, which aligns the approximation with the actual input…
Peer Reviews
Decision·ICLR 2026 Poster
- The problem is realistic and currently unsolved: we have many deployed MHA/GQA models, but we want the MLA advantages (smaller KV, more efficient decoding) without re-training from scratch. - The paper does not only say “data-aware compression is better”, but it derives the form by considering the activation-space objective, which is more solid. - The global and non-uniform rank allocation is motivated by data (different layers have different sensitivity) and the solution is standard and optim
1. The robustness of the covariance estimation is not fully shown. All calibration corpora are relatively similar; it is not clear whether for code or instruction-heavy data the same rank allocation is still good. 2. The relation to other data-/curvature-aware compression methods could be compared more directly on at least one setting.
1. While covariance-weighted factorization exists in prior work (FWSVD, SVD-LLM), the specific formulation of SVD(CW) followed by C^(-1) unwhitening for MLA conversion is fairly novel. 2. Adaptive rank allocation under fixed budget: The water-filling algorithm for distributing rank across layers based on weighted singular spectra is a good solution, with the empirical observation (Figure 2) that layers have heterogeneous sensitivity to rank reduction motivating this approach. The paper also doe
1. The paper's central claim uses ||C(W - Wc)||²F as a proxy for ||√C(W - Wc)||²F (page 5, Section 3.4), justified by a brief eigenspace argument that both are left-multiplied by the same eigenspaces of C with different weightings (λ²ᵢ vs λᵢ). This essentially squares the importance weights, which could over-emphasize dominant directions and under-represent moderate-variance directions that still matter for downstream tasks. The paper's claim that this "tends to preserve ordering of dominant co
- The proposed method is a clear improvement over prior works, and the insight on performing SVD over activations is indeed a good one. - The paper is well-written and easy to understand. - I really like observation 1 & 2 and it’s an interesting read.
- Line 068-069: I disagree that this is a whitening operation. Whitening is a very well-defined operation. If X is a matrix of shape (B, D), where B is batch size and D is dimension. Then whitening is X * (1/(B-1) * X^T X)^{-1/2}. Also we assume that X is centered. So I suggest avoiding using the very specific term of whitening with a mathematically precise definition. - In general, all the text in the figures are too small and impossible to read if you print out the paper. - Line 205-211: I don
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Machine Learning in Healthcare
