Two Sparse Matrices are Better than One: Sparsifying Neural Networks with Double Sparse Factorization
Vladim\'ir Bo\v{z}a, Vladim\'ir Macko

TL;DR
This paper introduces Double Sparse Factorization (DSF), a novel method for sparsifying neural networks by factorizing weight matrices into two sparse matrices, achieving significant size reduction while maintaining or improving performance.
Contribution
The paper proposes a new heuristic based on alternating minimization via ADMM for efficient double sparse factorization, enabling unprecedented neural network sparsification.
Findings
Reduced LLaMA2-13B model size by 50% with better performance than smaller dense models.
Outperformed state-of-the-art layer-wise pruning methods like Optimal Brain Compression.
Accuracy improvements persisted after further fine-tuning.
Abstract
Neural networks are often challenging to work with due to their large size and complexity. To address this, various methods aim to reduce model size by sparsifying or decomposing weight matrices, such as magnitude pruning and low-rank or block-diagonal factorization. In this work, we present Double Sparse Factorization (DSF), where we factorize each weight matrix into two sparse matrices. Although solving this problem exactly is computationally infeasible, we propose an efficient heuristic based on alternating minimization via ADMM that achieves state-of-the-art results, enabling unprecedented sparsification of neural networks. For instance, in a one-shot pruning setting, our method can reduce the size of the LLaMA2-13B model by 50% while maintaining better performance than the dense LLaMA2-7B model. We also compare favorably with Optimal Brain Compression, the state-of-the-art…
Peer Reviews
Decision·ICLR 2025 Poster
1. The idea is nice, the problem formulation is neat, and using ADMM for optimization is elegant. 2. On pruning LLAMA, the method shows clear benefit over the compared methods. Image classification experiments are marginally better than previous methods.
1. The SVD comparison is unfair in my opinion. SVD is more suited for low-rank compression and it may not enforce sparsity. Using the sparsity ratio as the main criterion may not be ideal. Why not use FLOPs? As FLOPS directly relates to inference speed as opposed to sparsity ratio. I would suggest that the authors include a comparison based on FLOPs in addition to the sparsity ratio. This would provide a more comprehensive evaluation of computational efficiency across different compression metho
* While pruning of factors obtained from matrix decomposition is not a novel contribution per se (Le Magoarou & Gribonval, 2016), its application to pretrained model compression is novel as far as I know. In any case, this work clearly distinguishes itself from prior art by focusing on the model compression task, particularly in the context of LLMs. * The paper is well written. * The empirical results outperform strong, SOTA baselines in a variety of contexts for LLMs and CNNS. * The authors
Overall, I am leaning towards accept. However, I have some significant concerns regarding the practical applicability of the proposed method. Fundamentally, we require compressed models that offer advantages in one or more of the following dimensions: memory overhead, latency, and/or throughput. For each of these dimensions, we can consider both training and inference. For the following discussion, let’s consider an intermediate fully-connected layer from a decoder block in a LLaMa 2-7B @ 50% s
The idea is interesting, to the best of my knowledge relatively novel, and the experiments are quite convincing. Most of the paper is fairly easy to follow and the reader is not left with many questions. I appreciate that the authors provide results before and after retraining the pruned models, as this is often not done in other papers. The proposed method is interesting, however there are open questions that I will discuss below.
I have several concerns regarding the soundness, clarity, and contribution of this work, which I detail below. I hope these remarks are helpful for improving the paper and am open to discussing my evaluation. ### Clarity While I think that the idea proposed in this paper might be promising, I sometimes had a hard time following the paper. I think the structure as well as details could be improved. - Section 3.1 would greatly benefit from a more detailed explanation of the ADMM method. How are Z
Code & Models
Videos
Taxonomy
TopicsNeural Networks and Applications · Matrix Theory and Algorithms · Face and Expression Recognition
MethodsPruning · Alternating Direction Method of Multipliers
