Contrastive Sequential-Diffusion Learning: Non-linear and Multi-Scene Instructional Video Synthesis
Vasco Ramos, Yonatan Bitton, Michal Yarom, Idan Szpektor, Joao, Magalhaes

TL;DR
This paper introduces a contrastive sequential diffusion method for multi-scene instructional video synthesis, ensuring visual consistency across non-linear scene sequences based on scene descriptions.
Contribution
It presents a novel contrastive learning approach that selects relevant previous scenes to guide the generation of subsequent scenes, improving multi-scene video coherence.
Findings
Enhanced scene consistency in generated videos
Outperforms previous methods in real-world action data
Effective handling of non-linear scene sequences
Abstract
Generated video scenes for action-centric sequence descriptions, such as recipe instructions and do-it-yourself projects, often include non-linear patterns, where the next video may need to be visually consistent not with the immediately preceding video but with earlier ones. Current multi-scene video synthesis approaches fail to meet these consistency requirements. To address this, we propose a contrastive sequential video diffusion method that selects the most suitable previously generated scene to guide and condition the denoising process of the next scene. The result is a multi-scene video that is grounded in the scene descriptions and coherent w.r.t. the scenes that require visual consistency. Experiments with action-centered data from the real world demonstrate the practicality and improved consistency of our model compared to previous work.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Music Technology and Sound Studies · Subtitles and Audiovisual Media
MethodsDiffusion
