FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion
Zhuokun Chen, Jianfei Cai, Bohan Zhuang

TL;DR
FlashBlock introduces a caching mechanism that leverages the stability of cross-step attention in block diffusion, significantly reducing computational overhead and improving efficiency in long-context generative models without sacrificing quality.
Contribution
The paper proposes FlashBlock, a novel attention caching method that reuses stable cross-step attention outputs, enhancing efficiency in long-context diffusion models.
Findings
Achieves up to 1.44× higher token throughput.
Reduces attention computation time by up to 1.6×.
Maintains negligible impact on generation quality.
Abstract
Generating long-form content, such as minute-long videos and extended texts, is increasingly important for modern generative models. Block diffusion improves inference efficiency via KV caching and block-wise causal inference and has been widely adopted in diffusion language models and video generation. However, in long-context settings, block diffusion still incurs substantial overhead from repeatedly computing attention over a growing KV cache. We identify an underexplored property of block diffusion: cross-step redundancy of attention within a block. Our analysis shows that attention outputs from tokens outside the current block remain largely stable across diffusion steps, while block-internal attention varies significantly. Based on this observation, we propose FlashBlock, a cached block-external attention mechanism that reuses stable attention output, reducing attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Machine Learning in Healthcare · Multimodal Machine Learning Applications
