Flow caching for autoregressive video generation
Yuexiao Ma, Xuzhe Zheng, Jing Xu, Xiwei Xu, Feng Ling, Xiawu Zheng, Huafeng Kuang, Huixia Li, Xing Wang, Xuefeng Xiao, Fei Chao, Rongrong Ji

TL;DR
FlowCache introduces a chunkwise caching framework for autoregressive video generation, significantly accelerating the process while maintaining high quality, enabling real-time ultra-long video synthesis.
Contribution
This paper presents the first caching framework tailored for autoregressive video models, with dynamic chunkwise policies and optimized cache compression.
Findings
Achieves 2.38x speedup on MAGI-1
Achieves 6.7x speedup on SkyReels-V2
Maintains high generation quality with negligible degradation
Abstract
Autoregressive models, often built on Transformer architectures, represent a powerful paradigm for generating ultra-long videos by synthesizing content in sequential chunks. However, this sequential generation process is notoriously slow. While caching strategies have proven effective for accelerating traditional video diffusion models, existing methods assume uniform denoising across all frames-an assumption that breaks down in autoregressive models where different video chunks exhibit varying similarity patterns at identical timesteps. In this paper, we present FlowCache, the first caching framework specifically designed for autoregressive video generation. Our key insight is that each video chunk should maintain independent caching policies, allowing fine-grained control over which chunks require recomputation at each timestep. We introduce a chunkwise caching strategy that…
Peer Reviews
Decision·ICLR 2026 Poster
- The dynamic assignment of calculate or reuse significant improves the efficiency of the model i.e., more than 2x comparing to TeaCache-fast. - The proposed KV cache compression method balances both past token visual importance and redundancy. It is specifically designed for video token characteristics.
- The idea of reuse or recompute based on L1 similarity of L1rel is clearly stated and proved, but the detail of how to decide to reuse or recompute the cache is not clear. Is there any threshold or decision making module for this part? - It is not clear why the proposed method both out perform the baseline method, TeaCache both in terms of speed and video quality. The reuse operation accelerates the speed, but why it also achieve better frame quality. It would better to have more detailed comp
- The paper proposed training-free, plug-in acceleration for causal video diffusion. Treating each video chunk as its own, with an independent reuse policy, is well motivated by the observed heterogeneity across chunks at the same timesteps. - Experiment isolates the benefit of chunkwise feature reuse over full reuse and shows that kv-compression has a small impact.
- Memory claims lack evidence. Memory usage is stated to be fixed, but no benchmark on memory usage is exhibited in the paper. - Lack of quality comparison. Only a few images are displayed in the paper. No video clip was provided, making it hard to evaluate the visual quality. - Lack of evaluation on long-video benchmark, e.g., VBench-long, since the method is claimed to be helpful for long-video generation. - MAGI-1 already applied window attention (8-second preceding video content). Weakening
1. The paper's primary strength is its clear identification and empirical validation of *why* existing caching methods fail for autoregressive models: the "heterogeneous denoising states." This is a sharp and important insight. 2. The "chunkwise caching" policy is a direct, logical, and highly effective solution to the problem identified. The ablation study decisively proves that this chunk-specific strategy is the main reason for the method's success in preserving quality. 3. Additionally, the
1. There is an internal contradiction between the paper's theory and its empirical results. **Theorem 1** (and its proof in Appendix B) is used to establish that the relative L1 distance is a *monotonically decreasing* function of time (as $t$ goes from 0 to T). However, the paper's *own empirical plot* (Figure 2) and *main text* (e.g., "relative L1 distance monotonically increases as denoising progresses" in line 302) show the exact opposite. This contradiction undermines the stated theoretical
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Coding and Compression Technologies · Image and Video Quality Assessment · Caching and Content Delivery
