SparseDM: Toward Sparse Efficient Diffusion Models
Kafeng Wang, Jianfei Chen, He Li, Zhenpeng Mi, Jun Zhu

TL;DR
SparseDM introduces a method to sparsify diffusion models using sparse masks and transfer learning, significantly reducing computation and accelerating inference while maintaining high image quality.
Contribution
The paper presents a novel approach applying sparse masks and transfer learning to improve diffusion model deployment efficiency.
Findings
Reduces MACs by 50% while maintaining FID.
Achieves approximately 1.2x GPU acceleration.
Lower FID than other methods under similar MACs.
Abstract
Diffusion models represent a powerful family of generative models widely used for image and video generation. However, the time-consuming deployment, long inference time, and requirements on large memory hinder their applications on resource constrained devices. In this paper, we propose a method based on the improved Straight-Through Estimator to improve the deployment efficiency of diffusion models. Specifically, we add sparse masks to the Convolution and Linear layers in a pre-trained diffusion model, then transfer learn the sparse model during the fine-tuning stage and turn on the sparse masks during inference. Experimental results on a Transformer and UNet-based diffusion models demonstrate that our method reduces MACs by 50% while maintaining FID. Sparse models are accelerated by approximately 1.2x on the GPU. Under other MACs conditions, the FID is also lower than 1 compared to…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The writing is very clear, and the main idea is highlighted effectively.
1. The pruning strategy is based on existing structures, with a relatively simple motivation. There are already other methods that achieve similar results, such as using linear attention or directly training a smaller model with distillation. 2. Compared to directly using STE-based pruning, it does not further reduce the computational load. 3. In Section 3.2, "Transfer learn sparse diffusion models" strategy is mentioned, but it does not explain the significant differences between this strategy
1. This paper is well-written. 2. The motivation is clear enough. 3. The organization of this paper is great.
1. There is a typo in Eq5. Please also check all equations. Moreover, not all symbols have been explained. 2. The experiments are relatively limited. Specifically, only two U-ViT and DDPM are tested on the proposed pruning, which are proposed in 2022 and 2020 respectively. More recently proposed DiT or other methods should also be included. 3. The limitation and discussion are missing in this paper.
- The 2:4 sparse model calculation offers practical values for practitioners using NVIDIA Ampere architecture GPUs.
- While it may have some practical value for practitioners using NVIDIA Ampere architecture, the same technique may not benefit other practitioners or general researchers without access to Ampere architecture. - Besides, the straightforward idea of using masked training is neither interesting nor technically new. - More disappointingly, the speed acceleration due to this customized training for a particular architecture increases by x1.2 only. Studies related to reducing time steps for Diffu
* The paper introduces a simple fine-tuning method that converts existing diffusion models into sparse models, enabling them to be used in scenarios with limited computing power, such as on mobile devices. * The observations about fixed sparse training are interesting. * Experiments on various generation scenarios verify the effectiveness of SparseDM compared to baselines.
**Weakness 1: More clarifications on Section 2.3.** In Section 2.3, the authors claim that diffusion models only consider the distribution shift of the noisy data while sparse pruning methods only consider the model's weight change. Then, referring to RFR, the authors convert the model's weight changes resulting from sparse pruning methods into data changes for the diffusion model's training process. However, typical diffusion models have indicators for perturbed data (such as the noise schedul
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Brain Tumor Detection and Classification · Stochastic Gradient Optimization Techniques
MethodsConvolution · Diffusion
