TL;DR
HierarchicalPrune is a novel compression framework that leverages the functional hierarchy of diffusion model blocks to significantly reduce model size and inference latency while maintaining high output quality.
Contribution
This work introduces HierarchicalPrune, combining position-aware pruning, weight preservation, and sensitivity-guided distillation for effective diffusion model compression.
Findings
Achieves up to 80% memory reduction with minimal quality loss.
Reduces inference latency by up to 38%.
Maintains perceptual quality comparable to original models.
Abstract
State-of-the-art text-to-image diffusion models (DMs) achieve remarkable quality, yet their massive parameter scale (8-11B) poses significant challenges for inferences on resource-constrained devices. In this paper, we present HierarchicalPrune, a novel compression framework grounded in a key observation: DM blocks exhibit distinct functional hierarchies, where early blocks establish semantic structures while later blocks handle texture refinements. HierarchicalPrune synergistically combines three techniques: (1) Hierarchical Position Pruning, which identifies and removes less essential later blocks based on position hierarchy; (2) Positional Weight Preservation, which systematically protects early model portions that are essential for semantic structural integrity; and (3) Sensitivity-Guided Distillation, which adjusts knowledge-transfer intensity based on our discovery of block-wise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
