Token Caching for Diffusion Transformer Acceleration
Jinming Lou, Wenyang Luo, Yufan Liu, Bing Li, Xinmiao Ding, Weiming Hu, Yuming Li, Chenguang Ma

TL;DR
TokenCache significantly accelerates diffusion transformers by intelligently caching tokens, reducing redundant computations while maintaining high generation quality, thus enhancing practical applicability.
Contribution
Introduces TokenCache, a novel method that hierarchically optimizes token pruning, block selection, and temporal scheduling for efficient diffusion transformer acceleration.
Findings
Achieves faster inference with minimal quality loss.
Effectively balances speed and accuracy across models.
Demonstrates substantial speedup in diffusion generation tasks.
Abstract
Diffusion transformers have gained substantial interest in diffusion generative modeling due to their outstanding performance. However, their computational demands, particularly the quadratic complexity of attention mechanisms and multi-step inference processes, present substantial bottlenecks that limit their practical applications. To address these challenges, we propose TokenCache, a novel acceleration method that leverages the token-based multi-block architecture of transformers to reduce redundant computations. TokenCache tackles three critical questions: (1) Which tokens should be pruned and reused by the caching mechanism to eliminate redundancy? (2) Which blocks should be targeted for efficient caching? (3) At which time steps should caching be applied to balance speed and quality? In response to these challenges, TokenCache introduces a Cache Predictor that hierarchically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMagnetic Properties and Applications · Power Transformer Diagnostics and Insulation · Non-Destructive Testing Techniques
MethodsSoftmax · Attention Is All You Need · Pruning · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Focus · Diffusion
