SparseDiT: Token Sparsification for Efficient Diffusion Transformer

Shuning Chang; Pichao Wang; Jiasheng Tang; Fan Wang; Yi Yang

arXiv:2412.06028·cs.CV·September 24, 2025

SparseDiT: Token Sparsification for Efficient Diffusion Transformer

Shuning Chang, Pichao Wang, Jiasheng Tang, Fan Wang, Yi Yang

PDF

Open Access 1 Repo

TL;DR

SparseDiT introduces a token sparsification framework for diffusion transformers, significantly reducing computational costs while maintaining high-quality image and video generation performance.

Contribution

It proposes a novel spatial-temporal token sparsification method that enhances efficiency in diffusion transformers without sacrificing generative quality.

Findings

01

Achieves 55% reduction in FLOPs on DiT-XL with similar FID score.

02

Improves inference speed by 175% on DiT-XL.

03

Reduces FLOPs by 56% across video datasets.

Abstract

Diffusion Transformers (DiT) are renowned for their impressive generative performance; however, they are significantly constrained by considerable computational costs due to the quadratic complexity in self-attention and the extensive sampling steps required. While advancements have been made in expediting the sampling process, the underlying architectural inefficiencies within DiT remain underexplored. We introduce SparseDiT, a novel framework that implements token sparsification across spatial and temporal dimensions to enhance computational efficiency while preserving generative quality. Spatially, SparseDiT employs a tri-segment architecture that allocates token density based on feature requirements at each layer: Poolingformer in the bottom layers for efficient global feature extraction, Sparse-Dense Token Modules (SDTM) in the middle layers to balance global context with local…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

changsn/FlexDiT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMetal and Thin Film Mechanics · Welding Techniques and Residual Stresses · Advancements in Photolithography Techniques

MethodsPruning · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings