Accelerating Diffusion Transformers with Token-wise Feature Caching
Chang Zou, Xuyang Liu, Ting Liu, Siteng Huang, Linfeng Zhang

TL;DR
This paper introduces token-wise feature caching for diffusion transformers, adaptively selecting tokens for caching to accelerate image and video synthesis while maintaining high quality.
Contribution
It proposes a novel token-wise caching method that adaptively chooses tokens for caching and applies different caching ratios across layers, improving efficiency without retraining.
Findings
Achieves 2.36× and 1.93× acceleration on OpenSora and PixArt-α.
No significant drop in generation quality.
Effective for both image and video synthesis.
Abstract
Diffusion transformers have shown significant effectiveness in both image and video synthesis at the expense of huge computation costs. To address this problem, feature caching methods have been introduced to accelerate diffusion transformers by caching the features in previous timesteps and reusing them in the following timesteps. However, previous caching methods ignore that different tokens exhibit different sensitivities to feature caching, and feature caching on some tokens may lead to 10 more destruction to the overall generation quality compared with other tokens. In this paper, we introduce token-wise feature caching, allowing us to adaptively select the most suitable tokens for caching, and further enable us to apply different caching ratios to neural layers in different types and depths. Extensive experiments on PixArt-, OpenSora, and DiT demonstrate our…
Peer Reviews
Decision·ICLR 2025 Poster
- The visualizations of temporal redundancy and error propagation illustrate the motivation of ToCa in an intuitive way, making the design choices relatively well motivated. - The four token selection scoring functions and layer-specific cache ratios are natural design choices that enable the caching strategy to be more fine-grained - ToCa achieves more than 2x acceleration ratios on both text-to-image and text-to-video tasks, while having better quality than baselines. - ToCa’s training-free na
- While ToCa achieves reasonable benchmark results, some artifacts remain in the generated images compared to the originals. For instance, in Figure 6, the moon is missing from the "wolf howling at the full moon" prompt, and the background forests appear blurred in the "tranquil beach" prompt. It is necessary to demonstrate how ToCa performs with high-resolution images (1024x1024) generated by more advanced models like FLUX.1-dev [1]. - Another important direction in accelerating diffusion model
1. The motivation behind the proposed approach is both technically sound and clearly explained, supported by two informative figures illustrating the varying levels of similarity among different tokens and the accumulation of error across these tokens. These visual aids effectively convey the rationale for the proposed method. 2. The methodology for selecting scores and making decisions for each layer is novel, offering valuable insights into the acceleration of transformer models. This innovat
1. There appears to be a lack of a complete algorithmic description within the manuscript. Specifically, a detailed explanation is needed for the token selection process at each layer and each timestep, as well as the procedure for redistributing the cached tokens back into the overall framework. This omission could lead to misunderstandings regarding the efficacy and functionality of the proposed method. 2. The manuscript does not adequately illustrate the computation costs associated with the
- The paper is well-written and easy to understand. - The observation that the similarity and propagated error differ for each token in DiT is very insightful. - Various metrics were proposed to measure the importance of tokens on reusing. - Experiments were conducted on a variety of datasets, including not only text-to-image but also text-to-video.
- It's difficult to understand how token-wise caching leads to actual acceleration. Since the attention layer calculates the similarity among all inputs through matrix multiplication, even if one output token is not computed, there won't be a significant difference in the overall computational cost (similar to unstructured pruning). A more detailed explanation of where this acceleration comes from or breakdown is required. - Evaluation is performed with just single Latency/FID point. An evaluat
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Advanced Data Storage Technologies · Stochastic Gradient Optimization Techniques
MethodsDiffusion
