TL;DR
This paper introduces Pyramid Forcing, a head-aware KV cache policy for autoregressive long video generation, improving quality by recognizing different attention head types and applying tailored cache strategies.
Contribution
It identifies three distinct attention head types and proposes a novel head-aware pyramidal KVCache framework that enhances long video synthesis quality.
Findings
Improved long-horizon generation quality on VBench-Long.
Increased Self Forcing score from 77.87 to 81.21.
Enhanced motion dynamics, visual fidelity, and semantic consistency.
Abstract
Autoregressive video generation enables streaming and open-ended long video synthesis, but still suffers from long-term degradation caused by accumulated errors. Existing KVCache strategies usually apply unified historical-frame retention, implicitly assuming homogeneous historical dependencies across attention heads. We revisit historical-frame attention and reveal three distinct head types: Anchor Heads require broad long-range context, Wave Heads exhibit periodic temporal dependencies, and Veil Heads focus on initial and adjacent frames. Based on this finding, we propose Pyramid Forcing, a head-aware pyramidal KVCache framework that identifies head types offline, assigns behavior-specific cache policies, and supports heterogeneous cache lengths via efficient ragged-cache attention. Experiments on Self Forcing and Causal Forcing show that Pyramid Forcing consistently improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
