DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis

Wenjie Tian; Xinfa Zhu; Haohe Liu; Zhixian Zhao; Zihao Chen; Chaofan Ding; Xinhan Di; Junjie Zheng; Lei Xie

arXiv:2507.10109·cs.MM·July 15, 2025

DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis

Wenjie Tian, Xinfa Zhu, Haohe Liu, Zhixian Zhao, Zihao Chen, Chaofan Ding, Xinhan Di, Junjie Zheng, Lei Xie

PDF

Open Access

TL;DR

DualDub is a novel framework that jointly generates synchronized background audio and speech from videos, addressing a gap in existing models by producing more comprehensive soundtracks.

Contribution

The paper introduces DualDub, a unified multimodal model for video-to-soundtrack generation, including a new benchmark and a curriculum learning strategy to handle data scarcity.

Findings

01

Achieves state-of-the-art performance in V2ST tasks.

02

Generates high-quality, synchronized background and speech audio.

03

Introduces the first benchmark for V2ST evaluation.

Abstract

While recent video-to-audio (V2A) models can generate realistic background audio from visual input, they largely overlook speech, an essential part of many video soundtracks. This paper proposes a new task, video-to-soundtrack (V2ST) generation, which aims to jointly produce synchronized background audio and speech within a unified framework. To tackle V2ST, we introduce DualDub, a unified framework built on a multimodal language model that integrates a multimodal encoder, a cross-modal aligner, and dual decoding heads for simultaneous background audio and speech generation. Specifically, our proposed cross-modal aligner employs causal and non-causal attention mechanisms to improve synchronization and acoustic harmony. Besides, to handle data scarcity, we design a curriculum learning strategy that progressively builds the multimodal capability. Finally, we introduce DualBench, the first…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Music Technology and Sound Studies