Learnable Sparsity for Vision Generative Models

Yang Zhang; Er Jin; Wenzhong Liang; Yanfei Dong; Ashkan Khakzar; Philip Torr; Johannes Stegmaier; Kenji Kawaguchi

arXiv:2412.02852·cs.CV·March 6, 2026

Learnable Sparsity for Vision Generative Models

Yang Zhang, Er Jin, Wenzhong Liang, Yanfei Dong, Ashkan Khakzar, Philip Torr, Johannes Stegmaier, Kenji Kawaguchi

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a low-cost, retraining-free structural pruning method for diffusion models in vision tasks, which learns differentiable masks and employs memory-efficient techniques to prune up to 20% of parameters with minimal performance loss.

Contribution

It presents a novel end-to-end, model-agnostic pruning framework for diffusion models that does not require retraining and incorporates memory-saving gradient checkpointing.

Findings

01

Prunes up to 20% of parameters with minimal quality loss.

02

Effective pruning achieved on state-of-the-art diffusion models SDXL and FLUX.

03

Maintains performance even when applied to time step distilled diffusion models.

Abstract

Diffusion models have achieved impressive advancements in various vision tasks. However, these gains often rely on increasing model size, which escalates computational complexity and memory demands, complicating deployment, raising inference costs, and causing environmental impact. While some studies have explored pruning techniques to improve the memory efficiency of diffusion models, most existing methods require extensive retraining to retain the model performance. Retraining a modern large diffusion model is extremely costly and resource-intensive, which limits the practicality of these methods. In this work, we achieve low-cost diffusion pruning without retraining by proposing a model-agnostic structural pruning framework for diffusion models that learns a differentiable mask to sparsify the model. To ensure effective pruning that preserves the quality of the final denoised latent,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

General framework across architectures and paradigms EcoDiff is demonstrated on both U-Net diffusion models (SD2, SDXL) and DiT-based flow models (FLUX-dev, FLUX-schnell), with a unified mask-learning approach. This cross-architecture applicability is a strong plus. Memory-efficient optimization via time-step checkpointing The proposed time-step gradient checkpointing significantly reduces memory usage for backprop through the full trajectory, making mask learning feasible for large models on

Weaknesses

Missing baselines The paper omits comparison with recent pruning methods such as LD-Pruner (Castells et al., 2024) and Efficient Pruning of Text-to-Image Models (Ramesh & Zhao, 2024). LD-Pruner proposes its own task-agnostic structured pruning strategy for latent diffusion models, while Efficient Pruning reports strong results using simple baselines like magnitude and WANDA pruning. Including both would better contextualize EcoDiff’s performance relative to recent structured and baseline pruni

Reviewer 02Rating 4Confidence 4

Strengths

(1) The paper is technically sound, with various empirical results across multiple models (SDXL, FLUX) and metrics (FID, CLIP, SSIM). The writing is clear and well-structured. (2) The work addresses a critical problem—efficient deployment of large generative models—and offers a practical solution with low computational overhead. The method seems model-agnostic and compatible with existing acceleration techniques.

Weaknesses

(1) The approximation of the $L_0$ regularization to $L_1$ (Appendix A) is heuristic and lacks theoretical guarantees. More rigorous analysis would strengthen the method. (2) The method is only validated on image generation tasks. Its applicability to video generation remains unverified. Moreover, the assumption that all denoising steps share the same mask may not hold for models with highly temporal dynamics. (3) The use of synthetic data from GCC3M for retraining, rather than original train

Reviewer 03Rating 6Confidence 4

Strengths

1. Pruning generative models facilitates their deployment and application. 2. The paper is presented intuitively and clearly. 3. The method is simple and effective.

Weaknesses

1. The mask isn't actually composed of only 0s and 1s; its value can range from [0,1]. How should this be handled? This would likely result in a performance penalty. 2. While this paper is very effective in engineering, its overall contribution is incremental. Differentiable masks and gradient checkpointing are readily available techniques.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSimulation Techniques and Applications

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Convolution · Concatenated Skip Connection · Diffusion · Max Pooling · Pruning · U-Net