Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation
Xin Yan, Yuxuan Cai, Qiuyue Wang, Yuan Zhou, Wenhao Huang, Huan Yang

TL;DR
Presto is a new long video diffusion model that uses segmented cross-attention and a large, annotated dataset to generate coherent, content-rich 15-second videos with improved semantic and dynamic quality.
Contribution
We introduce Segmented Cross-Attention for better long-range coherence and build LongTake-HD, a large dataset with detailed annotations, advancing long video generation capabilities.
Findings
Presto achieves 78.5% on VBench Semantic Score.
Presto attains 100% on Dynamic Degree.
Outperforms existing state-of-the-art methods.
Abstract
We introduce Presto, a novel video diffusion model designed to generate 15-second videos with long-range coherence and rich content. Extending video generation methods to maintain scenario diversity over long durations presents significant challenges. To address this, we propose a Segmented Cross-Attention (SCA) strategy, which splits hidden states into segments along the temporal dimension, allowing each segment to cross-attend to a corresponding sub-caption. SCA requires no additional parameters, enabling seamless incorporation into current DiT-based architectures. To facilitate high-quality long video generation, we build the LongTake-HD dataset, consisting of 261k content-rich videos with scenario coherence, annotated with an overall video caption and five progressive sub-captions. Experiments show that our Presto achieves 78.5% on the VBench Semantic Score and 100% on the Dynamic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Data Compression Techniques · Video Coding and Compression Technologies
MethodsDiffusion · Semantic Cross Attention
