Audio-Sync Video Generation with Multi-Stream Temporal Control
Shuchen Weng, Haojie Zheng, Zheng Chang, Si Li, Boxin Shi, Xinlong Wang

TL;DR
The paper introduces MTV, a framework for generating high-quality, synchronized videos from audio by disentangling audio types, supported by a new dataset DEMIX, achieving state-of-the-art results in audio-visual synchronization and quality.
Contribution
MTV is a novel framework that separates audio into speech, effects, and music for fine-grained control, supported by DEMIX, a new dataset for diverse audio-visual generation scenarios.
Findings
MTV outperforms existing methods on six standard metrics.
Disentangled audio control improves lip motion, event timing, and mood.
DEMIX enables scalable multi-stage training for diverse scenarios.
Abstract
Audio is inherently temporal and closely synchronized with the visual world, making it a naturally aligned and expressive control signal for controllable video generation (e.g., movies). Beyond control, directly translating audio into video is essential for understanding and visualizing rich audio narratives (e.g., Podcasts or historical recordings). However, existing approaches fall short in generating high-quality videos with precise audio-visual synchronization, especially across diverse and complex audio types. In this work, we introduce MTV, a versatile framework for audio-sync video generation. MTV explicitly separates audios into speech, effects, and music tracks, enabling disentangled control over lip motion, event timing, and visual mood, respectively -- resulting in fine-grained and semantically aligned video generation. To support the framework, we additionally present DEMIX,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies · Speech and Audio Processing
