FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion

Zhuokun Chen; Jianfei Cai; Bohan Zhuang

arXiv:2602.05305·cs.CV·February 9, 2026

FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion

Zhuokun Chen, Jianfei Cai, Bohan Zhuang

PDF

Open Access

TL;DR

FlashBlock introduces a caching mechanism that leverages the stability of cross-step attention in block diffusion, significantly reducing computational overhead and improving efficiency in long-context generative models without sacrificing quality.

Contribution

The paper proposes FlashBlock, a novel attention caching method that reuses stable cross-step attention outputs, enhancing efficiency in long-context diffusion models.

Findings

01

Achieves up to 1.44× higher token throughput.

02

Reduces attention computation time by up to 1.6×.

03

Maintains negligible impact on generation quality.

Abstract

Generating long-form content, such as minute-long videos and extended texts, is increasingly important for modern generative models. Block diffusion improves inference efficiency via KV caching and block-wise causal inference and has been widely adopted in diffusion language models and video generation. However, in long-context settings, block diffusion still incurs substantial overhead from repeatedly computing attention over a growing KV cache. We identify an underexplored property of block diffusion: cross-step redundancy of attention within a block. Our analysis shows that attention outputs from tokens outside the current block remain largely stable across diffusion steps, while block-internal attention varies significantly. Based on this observation, we propose FlashBlock, a cached block-external attention mechanism that reuses stable attention output, reducing attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Machine Learning in Healthcare · Multimodal Machine Learning Applications