TL;DR
DiCache is a novel, training-free, adaptive caching strategy for diffusion models that dynamically determines when and how to cache, significantly improving efficiency and fidelity across various models by leveraging real-time feature trajectory analysis.
Contribution
The paper introduces DiCache, a new runtime adaptive caching method that uses online probing and feature trajectory alignment, overcoming limitations of prior fixed or dataset-dependent caching strategies.
Findings
Achieves higher efficiency than state-of-the-art methods.
Improves visual fidelity in diffusion model outputs.
Demonstrates effectiveness across multiple diffusion models.
Abstract
Recent years have witnessed the rapid development of acceleration techniques for diffusion models, especially caching-based acceleration methods. These studies seek to answer two fundamental questions: "When to cache" and "How to use cache", typically relying on predefined empirical laws or dataset-level priors to determine caching timings and adopting handcrafted rules for multi-step cache utilization. However, given the highly dynamic nature of the diffusion process, they often exhibit limited generalizability and fail to cope with diverse samples. In this paper, a strong sample-specific correlation is revealed between the variation patterns of the shallow-layer feature differences in the diffusion model and those of deep-layer features. Moreover, we have observed that the features from different model layers form similar trajectories. Based on these observations, we present DiCache,…
Peer Reviews
Decision·ICLR 2026 Poster
* The motivation behind the proposed method is well articulated. The method's design is strongly supported by empirical evidence presented in the paper. * The method is completely training-free, making it highly practical and broadly applicable across different diffusion models.
* The proposed reuse threshold δ appears to require manual, per-model tuning (δ = 0.2 for WAN 2.1, δ = 0.1 for HunyuanVideo, δ = 0.4 for Flux), which may reduce generality and increase tuning effort for new architectures. * While the probe is “shallow,” it is still computed at every single timestep to accumulate caching error. It remains unclear how much speedup is offset by repeated probing on large backbones.
Novel idea on dynamic cache trajectory alignment that effectively and efficiently adapts the cached value to the current layer that reuses it. training-free and plug-and-play, requiring no model fine-tuning; it works at inference by wrapping around any DiT models. DiCache consistently achieves faster inference without sacrificing output quality, outperforming prior caching methods on both image and video diffusion models Clear analysis and ablations with generally smooth writing and informati
1. **Reliance on Threshold Hyperparameter:** Although DiCache demonstrates effective runtime caching under the reported experimental settings, the chosen probe depth (m) and accumulated caching error threshold (δ) should ideally generalize across different models. Alternatively, the authors could justify that the method’s effectiveness is not sensitive to these hyperparameters to substantiate the “calibration-free” claim. However, the current analysis of both hyperparameters lacks evidence of su
1. The paper addresses the two fundamental challenges of cache-based acceleration through a unified probe-driven framework, which reduces reliance on empirical heuristics and offline calibration. 2. DiCache can be further combined with Sparse VideoGen to achieve additional acceleration, demonstrating its complementarity with sparse attention techniques. 3. The authors empirically observe a strong correlation between shallow-layer feature differences and deep-layer residuals, and find that feat
1. The coverage of baselines is somewhat limited. Although the experimental tables include TeaCache, EasyCache, TaylorSeer, and ToCa, the Related Work section also discusses other comparable methods such as FasterCache, FORA, and Δ-DiT. Incorporating these methods into the quantitative comparison tables would make the empirical positioning of the proposed approach more complete. 2. The paper primarily evaluates performance using similarity metrics (LPIPS, SSIM, and PSNR) with respect to the out
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
