OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition
Stephen Zhang, Vardan Papyan

TL;DR
OATS is a novel pruning method for large transformers that decomposes weights into sparse and low-rank components using second moment information, achieving high compression with minimal performance loss without retraining.
Contribution
Introduces OATS, a new outlier-aware pruning technique leveraging second moment data for effective, retraining-free compression of large models.
Findings
Achieves up to 60% model compression without retraining.
Delivers up to 1.37x CPU acceleration.
Maintains state-of-the-art performance on large language and vision models.
Abstract
The recent paradigm shift to large-scale foundation models has brought about a new era for deep learning that, while has found great success in practice, has also been plagued by prohibitively expensive costs in terms of high memory consumption and compute. To mitigate these issues, there has been a concerted effort in post-hoc neural network pruning techniques that do not require costly retraining. Despite the considerable progress being made, existing methods often exhibit a steady drop in model performance as the compression increases. In this paper, we present a novel approach to compressing large transformers, coined OATS, that utilizes the second moment information in the input embeddings to decompose the model weights into a sum of sparse and low-rank matrices. Without any retraining, OATS achieves state-of-the-art performance when compressing models by up to on large…
Peer Reviews
Decision·ICLR 2025 Poster
- The concept of compressing a model as a sum of a sparse and low-rank matrix is very promising. Unlike most prior methods, which focus on one approach, OATS leverages both to potentially enhance performance. - OATS is retraining-free, which is crucial for practical applications where even a single backpropagation pass can be computationally prohibitive. - The framework has been tested on state-of-the-art models like Llama and ViT, demonstrating competitive performance. - The Alternating Thre
One concern is that the method relies on multiple calls to truncated SVD, which can be computationally intensive. Specifically, finding the top-$r$ singular values of an $m \times n$ matrix has a time complexity of $O(mnr)$. Given that compression speed is a significant factor for practical applications, it would be helpful if the authors could clarify the time complexity and wall-clock time spent on the compression process of the overall algorithm. This would offer a more concrete understanding
+ Low-rankness plus sparsity is a good fit for compressing LLMs without retraining. + By separating the model into a sparse and a low-rank part, the approximation error can be theoretically reduced as the two parts can compensate for each other. + This paper provides measurements for practical speedups on CPUs with existing sparse computation frameworks.
- The novelty is limited. The combination of low-rankness and sparsity is an old topic that has been explored for many years [R1, R2]. Applying the well-established approximation techniques to decompose/compress the large matrices in LLMs has little technical contribution. Besides, compressing DNN models using low-rank and sparse decomposition has already been well explored in [R3]. This paper just scales it to larger models and matrices. Authors are encouraged to specify the unique difference f
The method proposed by this paper is well-explained and well-justified. The actual practical algorithm is easy to follow. Real-time speedup is shown in the CPU setting. The experiments cover a range of model sizes (3.8-14B parameters). I especially liked seeing results on fairly small models, since those may be harder to compress. Section 5 is an interesting way to look at the problem, which I think can lead to interesting further work in the direction of interpretability.
The choice of the rank ratio parameter could have been better explored (in particular, looking at multiple architectures/tasks). Typos: Line 18: “approximating each weight” -> “approximating each weight matrix” Line 142: “the activations are calculated through a calibration set that is propagated through the compressed layers” - should be uncompressed?
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Multimodal Machine Learning Applications
MethodsPruning
