Contrastive Sequential-Diffusion Learning: Non-linear and Multi-Scene   Instructional Video Synthesis

Vasco Ramos; Yonatan Bitton; Michal Yarom; Idan Szpektor; Joao; Magalhaes

arXiv:2407.11814·cs.CV·December 10, 2024

Contrastive Sequential-Diffusion Learning: Non-linear and Multi-Scene Instructional Video Synthesis

Vasco Ramos, Yonatan Bitton, Michal Yarom, Idan Szpektor, Joao, Magalhaes

PDF

Open Access 1 Repo

TL;DR

This paper introduces a contrastive sequential diffusion method for multi-scene instructional video synthesis, ensuring visual consistency across non-linear scene sequences based on scene descriptions.

Contribution

It presents a novel contrastive learning approach that selects relevant previous scenes to guide the generation of subsequent scenes, improving multi-scene video coherence.

Findings

01

Enhanced scene consistency in generated videos

02

Outperforms previous methods in real-world action data

03

Effective handling of non-linear scene sequences

Abstract

Generated video scenes for action-centric sequence descriptions, such as recipe instructions and do-it-yourself projects, often include non-linear patterns, where the next video may need to be visually consistent not with the immediately preceding video but with earlier ones. Current multi-scene video synthesis approaches fail to meet these consistency requirements. To address this, we propose a contrastive sequential video diffusion method that selects the most suitable previously generated scene to guide and condition the denoising process of the next scene. The result is a multi-scene video that is grounded in the scene descriptions and coherent w.r.t. the scenes that require visual consistency. Experiments with action-centered data from the real world demonstrate the practicality and improved consistency of our model compared to previous work.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

novasearch/cosed
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Music Technology and Sound Studies · Subtitles and Audiovisual Media

MethodsDiffusion