Multi-sentence Video Grounding for Long Video Generation

Wei Feng; Xin Wang; Hong Chen; Zeyang Zhang; Wenwu Zhu

arXiv:2407.13219·cs.CV·July 19, 2024

Multi-sentence Video Grounding for Long Video Generation

Wei Feng, Xin Wang, Hong Chen, Zeyang Zhang, Wenwu Zhu

PDF

Open Access

TL;DR

This paper introduces a novel approach for long video generation by integrating multi-sentence video grounding with video editing, enabling temporally consistent long videos with reduced memory costs.

Contribution

It pioneers connecting video grounding with long video generation, utilizing scene prompts and video editing to improve consistency and efficiency.

Findings

01

Effective long video generation with maintained temporal consistency.

02

Reduced memory cost through segment-wise editing.

03

Enhanced subject consistency with morphing and personalization.

Abstract

Video generation has witnessed great success recently, but their application in generating long videos still remains challenging due to the difficulty in maintaining the temporal consistency of generated videos and the high memory cost during generation. To tackle the problems, in this paper, we propose a brave and new idea of Multi-sentence Video Grounding for Long Video Generation, connecting the massive video moment retrieval to the video generation task for the first time, providing a new paradigm for long video generation. The method of our work can be summarized as three steps: (i) We design sequential scene text prompts as the queries for video grounding, utilizing the massive video moment retrieval to search for video moment segments that meet the text requirements in the video database. (ii) Based on the source frames of retrieved video moment segments, we adopt video editing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Multimodal Machine Learning Applications