Enhancing Delta Compression in LLMs via SVD-based Quantization Error Minimization

Boya Xiong; Shuo Wang; Weifeng Ge; Guanhua Chen; Yun Chen

arXiv:2506.11087·cs.LG·February 17, 2026

Enhancing Delta Compression in LLMs via SVD-based Quantization Error Minimization

Boya Xiong, Shuo Wang, Weifeng Ge, Guanhua Chen, Yun Chen

PDF

Open Access 3 Reviews

TL;DR

PrinMix introduces a mathematically grounded, SVD-based quantization framework for delta compression in LLMs, optimizing error minimization and outperforming state-of-the-art methods on large models.

Contribution

It models quantization as an optimization problem, derives a key scaling mechanism, and employs ILP for optimal bit allocation, advancing delta compression techniques.

Findings

01

Outperforms SOTA Delta-CoMe by 22.3% on AIME2024

02

Achieves 6.1% improvement on GQA benchmark

03

Effectively reduces storage for 7B LLMs

Abstract

Supervised Fine-Tuning (SFT) empowers Large Language Models (LLMs) with exceptional performance on specialized tasks, but it yields dense, high-dimensional delta parameters that pose severe storage and distribution challenges. Singular Value Decomposition (SVD)-based compression offers a compact representation for such delta parameters, but existing methods adopt heuristic quantization without clarifying underlying mechanisms, leading to poor generalizability. In this work, we propose PrinMix, a rigorous SVD-based framework that models quantization as an optimization problem, grounding the design in mathematical mechanisms. We first theoretically derive quantization error and identify a key singular-value-dominated scaling mechanism, which mathematically proves the necessity of mix-precision quantization. We then model the quantization scheme as a 0/1 Integer Linear Programming (ILP)…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

- Addresses a practical problem in model distribution and storage: delta checkpoint compression for multi-task or multi-domain fine-tuned models.

Weaknesses

- The core idea—combining low-rank and quantized residual compression, is well explored in prior works such as QLoRA, AdaLoRA, and CompAdapter. The proposed “layer-wise scaling reweighting” is a small variant of norm-based importance metrics used in parameter-efficient tuning. - The method is entirely empirical. The paper lacks mathematical justification or analysis on how the scaling or residual quantization improves representational fidelity beyond heuristic intuition. - Experiments are rest

Reviewer 02Rating 4Confidence 4

Strengths

* Principled objective: Explicitly minimizes a reconstruction-error surrogate in SVD space, yielding a clear justification for row-wise mixed precision of (V) under a bit budget. The $(\Sigma_{ii}^2)$ scaling vs. difference decomposition is intuitive and actionable. * Concrete optimization: Bit allocation via 0/1 ILP provides a crisp mechanism to trade off error and storage, with constraints for budget and a cap $(f_{\max})$ on distinct bitwidths. * RTC mechanism: The Reconstruction Target Cor

Weaknesses

1. Inconsistency with “no singular-value assumptions.” The method claims to avoid empirical reliance on singular values, yet Section D.1 discards the last (k) ranks by singular-value magnitude to accelerate quantization, explicitly invoking the “larger singular values are more important” heuristic that the paper earlier critiques. This weakens the methodological positioning and may bias comparisons. 2. Fair-budget accounting is under-specified. Results are reported at $(\alpha = 1/16)$, but the

Reviewer 03Rating 4Confidence 4

Strengths

**1. Strong theoretical foundation.** The work formalizes SVD-based delta-compression as an explicit quantization-error-minimization problem and proves the necessity of mixed-precision allocation, advancing the theoretical rigor of delta-compression research. **2. Comprehensive empirical validation.** Evaluations on 7B and 14B LLMs across four domains (reasoning, math, code, vision-language) show clear and reproducible gains over Delta-CoMe, BitDelta, and low-rank baselines. **3. Practical dep

Weaknesses

**1. Limited scalability analysis.** While integer-linear optimization is solved once per model, reported solving times (≈ 30 min for 7B) may become impractical for larger or frequent model updates. Discussion on scaling to 70B+ models is missing. **2. Ablation study.** Although four task types are covered, the paper lacks ablation on calibration-set size, bit-budget sensitivity, or robustness under distribution shift, which are important for real-world deployment. **3. Computational overhead

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · Algorithms and Data Compression