TL;DR
DisCa introduces a learnable feature caching mechanism for video diffusion transformers, significantly accelerating generation while maintaining quality, by employing a neural predictor and a stable distillation approach.
Contribution
The paper proposes the first distillation-compatible learnable feature caching method for video diffusion models, enhancing acceleration and stability.
Findings
Achieves 11.8× acceleration in video diffusion generation.
Maintains high-quality video generation with the new caching method.
Demonstrates effectiveness through extensive experiments.
Abstract
While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden. Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance, but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
