TL;DR
This paper introduces FoCa, a novel ODE-based approach for feature caching in diffusion transformers, significantly improving inference speed while maintaining high generation quality across various tasks.
Contribution
It models feature caching as a feature-ODE solving problem, enabling robust acceleration of diffusion transformers without additional training.
Findings
Achieves near-lossless speedups of 5.50x on FLUX and 6.45x on HunyuanVideo.
Maintains high quality with a 4.53x speedup on DiT.
Demonstrates effectiveness across image synthesis, video generation, and super-resolution tasks.
Abstract
Diffusion Transformers (DiTs) have demonstrated exceptional performance in high-fidelity image and video generation. To reduce their substantial computational costs, feature caching techniques have been proposed to accelerate inference by reusing hidden representations from previous timesteps. However, current methods often struggle to maintain generation quality at high acceleration ratios, where prediction errors increase sharply due to the inherent instability of long-step forecasting. In this work, we adopt an ordinary differential equation (ODE) perspective on the hidden-feature sequence, modeling layer representations along the trajectory as a feature-ODE. We attribute the degradation of existing caching strategies to their inability to robustly integrate historical features under large skipping intervals. To address this, we propose FoCa (Forecast-then-Calibrate), which treats…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
