Adaptive Caching for Faster Video Generation with Diffusion Transformers
Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia,, Chenyang Zhang, Michael S. Ryoo, Tian Xie

TL;DR
This paper presents AdaCache, a training-free adaptive caching method that accelerates diffusion transformer-based video generation by reducing computation based on video complexity, achieving up to 4.7x speedup without quality loss.
Contribution
The paper introduces AdaCache, a novel caching schedule and motion regularization scheme that significantly speeds up video diffusion transformers without degrading quality.
Findings
Achieves up to 4.7x inference speedup on 720p videos.
Maintains high-quality video generation across multiple baselines.
Provides a plug-and-play solution for faster diffusion-based video synthesis.
Abstract
Generating temporally-consistent high-fidelity videos can be computationally expensive, especially over longer temporal spans. More-recent Diffusion Transformers (DiTs) -- despite making significant headway in this context -- have only heightened such challenges as they rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. In this paper, we introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache), which is motivated by the fact that "not all videos are created equal": meaning, some videos require fewer denoising steps to attain a reasonable quality than others. Building on this, we not only cache computations through the diffusion process, but also devise a caching schedule tailored to each video generation, maximizing the quality-latency trade-off. We further introduce a Motion Regularization (MoReg) scheme to…
Peer Reviews
Decision·Submitted to ICLR 2025
1. Novelty: The idea of adaptive caching seems novel in the field of diffusion model caching. 2. Motivation: The paper provides a clear motivation for AdaCache method. 3. Clearness: The method is simple and easy to understand.
1. Method section requires clarifications: a. The paper lacks information about the selection of rate-of-change schedule hyperparameters. b. Lines 286-287 stat that authors observe that unique caching schedules for each layer will make the generations unstable. This important observation requires further explanation and clarification. 2. Experiment results require better presentation: a. There are concerns regarding the reported speedup and latency. Given that AdaCache is not a deterministi
1. Adaptive Caching achieve very good performance even compared with recent PAB paper. I very appreciate it. 2. This approach requires no training and can seamlessly be integrated into a baseline video DiT at inference, as a plug-and-play component. 3. Motion Regularization (MoReg) to allocate computations based on the motion content in the video being generated seems to be very reasonable.
1. Regarding the choice of metric, why was the Mean Squared Error (MSE) selected directly? Can the MSE metric truly reflect the actual reduction in features between adjacent steps? Are there alternative metrics that might be more suitable, or can you provide comparisons with other metrics such as the cosine similarity metric or others? 2. Secondly, I'm interested in knowing if the proposed method is compatible with large Text-to-Image (T2I) base models, like FLUX. If it is, what would be the ex
a. The approach presented is straightforward, and the method section is generally clear and easy to follow. b. The motivation for the work is reasonable and interesting, i.e., "not all videos are created equal". c. AdaCache provides a training-free acceleration method that can be applied to existing video diffusion models, achieving significant speedups without additional model training.
1. Lines 285-287 mention that using unique caching schedules for each layer makes the generations unstable, but it’s unclear why this is the case. It would help if the authors provided an explanation. 2. Equation 5 introduces a codebook for the caching rate, but it’s not clear what this codebook is or how it’s created. The authors should add more details to clarify this part of the method. 3. While Table 1 shows AdaCache outperforming PAB, the qualitative comparison in Fig. 7 shows a different r
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Advanced Data Storage Technologies · Cellular Automata and Applications
MethodsSoftmax · Attention Is All You Need · Diffusion
