Adaptive Caching for Faster Video Generation with Diffusion Transformers

Kumara Kahatapitiya; Haozhe Liu; Sen He; Ding Liu; Menglin Jia,; Chenyang Zhang; Michael S. Ryoo; Tian Xie

arXiv:2411.02397·cs.CV·November 8, 2024

Adaptive Caching for Faster Video Generation with Diffusion Transformers

Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia,, Chenyang Zhang, Michael S. Ryoo, Tian Xie

PDF

Open Access 3 Reviews

TL;DR

This paper presents AdaCache, a training-free adaptive caching method that accelerates diffusion transformer-based video generation by reducing computation based on video complexity, achieving up to 4.7x speedup without quality loss.

Contribution

The paper introduces AdaCache, a novel caching schedule and motion regularization scheme that significantly speeds up video diffusion transformers without degrading quality.

Findings

01

Achieves up to 4.7x inference speedup on 720p videos.

02

Maintains high-quality video generation across multiple baselines.

03

Provides a plug-and-play solution for faster diffusion-based video synthesis.

Abstract

Generating temporally-consistent high-fidelity videos can be computationally expensive, especially over longer temporal spans. More-recent Diffusion Transformers (DiTs) -- despite making significant headway in this context -- have only heightened such challenges as they rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. In this paper, we introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache), which is motivated by the fact that "not all videos are created equal": meaning, some videos require fewer denoising steps to attain a reasonable quality than others. Building on this, we not only cache computations through the diffusion process, but also devise a caching schedule tailored to each video generation, maximizing the quality-latency trade-off. We further introduce a Motion Regularization (MoReg) scheme to…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 4

Strengths

1. Novelty: The idea of adaptive caching seems novel in the field of diffusion model caching. 2. Motivation: The paper provides a clear motivation for AdaCache method. 3. Clearness: The method is simple and easy to understand.

Weaknesses

1. Method section requires clarifications: a. The paper lacks information about the selection of rate-of-change schedule hyperparameters. b. Lines 286-287 stat that authors observe that unique caching schedules for each layer will make the generations unstable. This important observation requires further explanation and clarification. 2. Experiment results require better presentation: a. There are concerns regarding the reported speedup and latency. Given that AdaCache is not a deterministi

Reviewer 02Rating 8Confidence 5

Strengths

1. Adaptive Caching achieve very good performance even compared with recent PAB paper. I very appreciate it. 2. This approach requires no training and can seamlessly be integrated into a baseline video DiT at inference, as a plug-and-play component. 3. Motion Regularization (MoReg) to allocate computations based on the motion content in the video being generated seems to be very reasonable.

Weaknesses

1. Regarding the choice of metric, why was the Mean Squared Error (MSE) selected directly? Can the MSE metric truly reflect the actual reduction in features between adjacent steps? Are there alternative metrics that might be more suitable, or can you provide comparisons with other metrics such as the cosine similarity metric or others? 2. Secondly, I'm interested in knowing if the proposed method is compatible with large Text-to-Image (T2I) base models, like FLUX. If it is, what would be the ex

Reviewer 03Rating 5Confidence 5

Strengths

a. The approach presented is straightforward, and the method section is generally clear and easy to follow. b. The motivation for the work is reasonable and interesting, i.e., "not all videos are created equal". c. AdaCache provides a training-free acceleration method that can be applied to existing video diffusion models, achieving significant speedups without additional model training.

Weaknesses

1. Lines 285-287 mention that using unique caching schedules for each layer makes the generations unstable, but it’s unclear why this is the case. It would help if the authors provided an explanation. 2. Equation 5 introduces a codebook for the caching rate, but it’s not clear what this codebook is or how it’s created. The authors should add more details to clarify this part of the method. 3. While Table 1 shows AdaCache outperforming PAB, the qualitative comparison in Fig. 7 shows a different r

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCaching and Content Delivery · Advanced Data Storage Technologies · Cellular Automata and Applications

MethodsSoftmax · Attention Is All You Need · Diffusion