Audio-Sync Video Generation with Multi-Stream Temporal Control

Shuchen Weng; Haojie Zheng; Zheng Chang; Si Li; Boxin Shi; Xinlong Wang

arXiv:2506.08003·cs.CV·June 10, 2025

Audio-Sync Video Generation with Multi-Stream Temporal Control

Shuchen Weng, Haojie Zheng, Zheng Chang, Si Li, Boxin Shi, Xinlong Wang

PDF

Open Access 1 Models 1 Video

TL;DR

The paper introduces MTV, a framework for generating high-quality, synchronized videos from audio by disentangling audio types, supported by a new dataset DEMIX, achieving state-of-the-art results in audio-visual synchronization and quality.

Contribution

MTV is a novel framework that separates audio into speech, effects, and music for fine-grained control, supported by DEMIX, a new dataset for diverse audio-visual generation scenarios.

Findings

01

MTV outperforms existing methods on six standard metrics.

02

Disentangled audio control improves lip motion, event timing, and mood.

03

DEMIX enables scalable multi-stage training for diverse scenarios.

Abstract

Audio is inherently temporal and closely synchronized with the visual world, making it a naturally aligned and expressive control signal for controllable video generation (e.g., movies). Beyond control, directly translating audio into video is essential for understanding and visualizing rich audio narratives (e.g., Podcasts or historical recordings). However, existing approaches fall short in generating high-quality videos with precise audio-visual synchronization, especially across diverse and complex audio types. In this work, we introduce MTV, a versatile framework for audio-sync video generation. MTV explicitly separates audios into speech, effects, and music tracks, enabling disentangled control over lip motion, event timing, and visual mood, respectively -- resulting in fine-grained and semantically aligned video generation. To support the framework, we additionally present DEMIX,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
BAAI/MTVCraft
model· 17 dl· ♡ 36
17 dl♡ 36

Videos

Audio-Sync Video Generation with Multi-Stream Temporal Control· slideslive

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies · Speech and Audio Processing