SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from   Text

Haohe Liu; Gael Le Lan; Xinhao Mei; Zhaoheng Ni; Anurag Kumar; Varun; Nagaraja; Wenwu Wang; Mark D. Plumbley; Yangyang Shi; Vikas Chandra

arXiv:2412.15220·cs.MM·December 23, 2024

SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from Text

Haohe Liu, Gael Le Lan, Xinhao Mei, Zhaoheng Ni, Anurag Kumar, Varun, Nagaraja, Wenwu Wang, Mark D. Plumbley, Yangyang Shi, Vikas Chandra

PDF

Open Access

TL;DR

SyncFlow is a novel system that generates synchronized audio and video from text using a dual-diffusion-transformer architecture, achieving better correlation, quality, and zero-shot adaptability compared to existing methods.

Contribution

The paper introduces SyncFlow, a joint audio-video generation model with a dual-diffusion-transformer architecture and a multi-stage training strategy for improved synchronization and quality.

Findings

01

Produces more correlated audio-visual outputs than baselines

02

Enhances audio quality and correspondence significantly

03

Demonstrates strong zero-shot generation and resolution adaptation

Abstract

Video and audio are closely correlated modalities that humans naturally perceive together. While recent advancements have enabled the generation of audio or video from text, producing both modalities simultaneously still typically relies on either a cascaded process or multi-modal contrastive encoders. These approaches, however, often lead to suboptimal results due to inherent information losses during inference and conditioning. In this paper, we introduce SyncFlow, a system that is capable of simultaneously generating temporally synchronized audio and video from text. The core of SyncFlow is the proposed dual-diffusion-transformer (d-DiT) architecture, which enables joint video and audio modelling with proper information fusion. To efficiently manage the computational cost of joint audio and video modelling, SyncFlow utilizes a multi-stage training strategy that separates video and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Music Technology and Sound Studies