SparseDiT: Token Sparsification for Efficient Diffusion Transformer
Shuning Chang, Pichao Wang, Jiasheng Tang, Fan Wang, Yi Yang

TL;DR
SparseDiT introduces a token sparsification framework for diffusion transformers, significantly reducing computational costs while maintaining high-quality image and video generation performance.
Contribution
It proposes a novel spatial-temporal token sparsification method that enhances efficiency in diffusion transformers without sacrificing generative quality.
Findings
Achieves 55% reduction in FLOPs on DiT-XL with similar FID score.
Improves inference speed by 175% on DiT-XL.
Reduces FLOPs by 56% across video datasets.
Abstract
Diffusion Transformers (DiT) are renowned for their impressive generative performance; however, they are significantly constrained by considerable computational costs due to the quadratic complexity in self-attention and the extensive sampling steps required. While advancements have been made in expediting the sampling process, the underlying architectural inefficiencies within DiT remain underexplored. We introduce SparseDiT, a novel framework that implements token sparsification across spatial and temporal dimensions to enhance computational efficiency while preserving generative quality. Spatially, SparseDiT employs a tri-segment architecture that allocates token density based on feature requirements at each layer: Poolingformer in the bottom layers for efficient global feature extraction, Sparse-Dense Token Modules (SDTM) in the middle layers to balance global context with local…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMetal and Thin Film Mechanics · Welding Techniques and Residual Stresses · Advancements in Photolithography Techniques
MethodsPruning · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
