TL;DR
This paper introduces Hybrid Forcing, a novel approach combining lightweight linear temporal and block-sparse attention with decoupled distillation to enable real-time, long-horizon streaming video generation with state-of-the-art quality.
Contribution
It proposes a hybrid attention mechanism and a tailored distillation strategy to improve long-range dependency modeling and computational efficiency in streaming video generation.
Findings
Achieves real-time 832x480 video at 29.5 FPS on a single GPU.
Outperforms existing methods on short- and long-form video benchmarks.
Maintains long-range dependencies with negligible overhead.
Abstract
Streaming video generation (SVG) distills a pretrained bidirectional video diffusion model into an autoregressive model equipped with sliding window attention (SWA). However, SWA inevitably loses distant history during long video generation, and its computational overhead remains a critical challenge to real-time deployment. In this work, we propose Hybrid Forcing, which jointly optimizes temporal information retention and computational efficiency through a hybrid attention design. First, we introduce lightweight linear temporal attention to preserve long-range dependencies beyond the sliding window. In particular, we maintain a compact key-value state to incrementally absorb evicted tokens, retaining temporal context with negligible memory and computational overhead. Second, we incorporate block-sparse attention into the local sliding window to reduce redundant computation within…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
