TL;DR
FastCache significantly accelerates diffusion transformer inference by using learnable linear approximation, token selection, and caching strategies to reduce computation while maintaining high generation quality.
Contribution
It introduces a novel caching and compression framework with learnable approximation for diffusion transformers, improving speed and efficiency.
Findings
Substantial latency and memory reduction demonstrated across multiple DiT variants.
FastCache achieves the best generation quality among existing cache methods, measured by FID and t-FID.
Theoretical analysis confirms bounded approximation error under certain decision rules.
Abstract
Diffusion Transformers (DiT) are powerful generative models but remain computationally intensive due to their iterative structure and deep transformer stacks. To alleviate this inefficiency, we propose \textbf{FastCache}, a hidden-state-level caching and compression framework that accelerates DiT inference by exploiting redundancy within the model's internal representations. FastCache introduces a dual strategy: (1) a spatial-aware token selection mechanism that adaptively filters redundant tokens based on hidden-state saliency, and (2) a transformer-level cache that reuses latent activations across timesteps when changes fall below a predefined threshold. These modules work jointly to reduce unnecessary computation while preserving generation fidelity through learnable linear approximation. Theoretical analysis shows that FastCache maintains bounded approximation error under a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
