Token Caching for Diffusion Transformer Acceleration

Jinming Lou; Wenyang Luo; Yufan Liu; Bing Li; Xinmiao Ding; Weiming Hu; Yuming Li; Chenguang Ma

arXiv:2409.18523·cs.LG·January 28, 2026

Token Caching for Diffusion Transformer Acceleration

Jinming Lou, Wenyang Luo, Yufan Liu, Bing Li, Xinmiao Ding, Weiming Hu, Yuming Li, Chenguang Ma

PDF

Open Access

TL;DR

TokenCache significantly accelerates diffusion transformers by intelligently caching tokens, reducing redundant computations while maintaining high generation quality, thus enhancing practical applicability.

Contribution

Introduces TokenCache, a novel method that hierarchically optimizes token pruning, block selection, and temporal scheduling for efficient diffusion transformer acceleration.

Findings

01

Achieves faster inference with minimal quality loss.

02

Effectively balances speed and accuracy across models.

03

Demonstrates substantial speedup in diffusion generation tasks.

Abstract

Diffusion transformers have gained substantial interest in diffusion generative modeling due to their outstanding performance. However, their computational demands, particularly the quadratic complexity of attention mechanisms and multi-step inference processes, present substantial bottlenecks that limit their practical applications. To address these challenges, we propose TokenCache, a novel acceleration method that leverages the token-based multi-block architecture of transformers to reduce redundant computations. TokenCache tackles three critical questions: (1) Which tokens should be pruned and reused by the caching mechanism to eliminate redundancy? (2) Which blocks should be targeted for efficient caching? (3) At which time steps should caching be applied to balance speed and quality? In response to these challenges, TokenCache introduces a Cache Predictor that hierarchically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMagnetic Properties and Applications · Power Transformer Diagnostics and Insulation · Non-Destructive Testing Techniques

MethodsSoftmax · Attention Is All You Need · Pruning · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Focus · Diffusion