Long Video Diffusion Generation with Segmented Cross-Attention and   Content-Rich Video Data Curation

Xin Yan; Yuxuan Cai; Qiuyue Wang; Yuan Zhou; Wenhao Huang; Huan Yang

arXiv:2412.01316·cs.CV·April 1, 2025

Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation

Xin Yan, Yuxuan Cai, Qiuyue Wang, Yuan Zhou, Wenhao Huang, Huan Yang

PDF

Open Access

TL;DR

Presto is a new long video diffusion model that uses segmented cross-attention and a large, annotated dataset to generate coherent, content-rich 15-second videos with improved semantic and dynamic quality.

Contribution

We introduce Segmented Cross-Attention for better long-range coherence and build LongTake-HD, a large dataset with detailed annotations, advancing long video generation capabilities.

Findings

01

Presto achieves 78.5% on VBench Semantic Score.

02

Presto attains 100% on Dynamic Degree.

03

Outperforms existing state-of-the-art methods.

Abstract

We introduce Presto, a novel video diffusion model designed to generate 15-second videos with long-range coherence and rich content. Extending video generation methods to maintain scenario diversity over long durations presents significant challenges. To address this, we propose a Segmented Cross-Attention (SCA) strategy, which splits hidden states into segments along the temporal dimension, allowing each segment to cross-attend to a corresponding sub-caption. SCA requires no additional parameters, enabling seamless incorporation into current DiT-based architectures. To facilitate high-quality long video generation, we build the LongTake-HD dataset, consisting of 261k content-rich videos with scenario coherence, annotated with an overall video caption and five progressive sub-captions. Experiments show that our Presto achieves 78.5% on the VBench Semantic Score and 100% on the Dynamic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Data Compression Techniques · Video Coding and Compression Technologies

MethodsDiffusion · Semantic Cross Attention