TL;DR
The paper introduces IO-SVD, a novel post-training compression technique for large language models that uses input-output whitening and loss-aware rank allocation to reduce model size with minimal performance loss.
Contribution
It proposes a KL-aware double-sided whitening space and an efficient heterogeneous rank-allocation strategy for improved LLM compression.
Findings
IO-SVD achieves minimal performance degradation after compression.
The method provides practical inference speedups across diverse models.
Loss-aware remapping enhances hybrid SVD-quantization compression.
Abstract
Large language models deliver strong performance across language and reasoning tasks, but their storage and compute costs remain major barriers to deployment in resource-constrained and latency-sensitive settings. SVD-based post-training compression offers a hardware-agnostic way to reduce model size and improve inference efficiency through low-rank factorization. However, existing methods often rely on input-only whitening spaces, homogeneous rank allocation, or loss-agnostic allocation heuristics, limiting their ability to preserve model quality under aggressive compression. We propose Input-Output Whitened SVD (IO-SVD), a post-training compression method that forms a KL-aware double-sided whitening space for model weights. Using a second-order expansion of the KL loss over the top-K token probabilities, IO-SVD constructs an output-side metric that captures predictive sensitivity,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
