SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from Text
Haohe Liu, Gael Le Lan, Xinhao Mei, Zhaoheng Ni, Anurag Kumar, Varun, Nagaraja, Wenwu Wang, Mark D. Plumbley, Yangyang Shi, Vikas Chandra

TL;DR
SyncFlow is a novel system that generates synchronized audio and video from text using a dual-diffusion-transformer architecture, achieving better correlation, quality, and zero-shot adaptability compared to existing methods.
Contribution
The paper introduces SyncFlow, a joint audio-video generation model with a dual-diffusion-transformer architecture and a multi-stage training strategy for improved synchronization and quality.
Findings
Produces more correlated audio-visual outputs than baselines
Enhances audio quality and correspondence significantly
Demonstrates strong zero-shot generation and resolution adaptation
Abstract
Video and audio are closely correlated modalities that humans naturally perceive together. While recent advancements have enabled the generation of audio or video from text, producing both modalities simultaneously still typically relies on either a cascaded process or multi-modal contrastive encoders. These approaches, however, often lead to suboptimal results due to inherent information losses during inference and conditioning. In this paper, we introduce SyncFlow, a system that is capable of simultaneously generating temporally synchronized audio and video from text. The core of SyncFlow is the proposed dual-diffusion-transformer (d-DiT) architecture, which enables joint video and audio modelling with proper information fusion. To efficiently manage the computational cost of joint audio and video modelling, SyncFlow utilizes a multi-stage training strategy that separates video and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Music Technology and Sound Studies
