SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model
Kaidi Wang, Yi He, Wenhao Guan, Weijie Wu, Hongwu Ding, Xiong Zhang, Di Wu, Meng Meng, Jian Luan, Lin Li, Qingyang Hong

TL;DR
SyncVoice is a novel framework that enhances video dubbing by integrating vision-augmented pretrained TTS models, improving speech naturalness and synchronization across languages and visual content.
Contribution
It introduces a vision-augmented TTS framework with a Dual Speaker Encoder for cross-lingual dubbing and demonstrates its effectiveness in high-fidelity, synchronized video dubbing.
Findings
Achieves high-fidelity speech with strong audiovisual synchronization
Effectively mitigates inter-language interference in cross-lingual synthesis
Demonstrates potential in video translation scenarios
Abstract
Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content. Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization, and are limited to monolingual settings. To address these challenges, we propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) model. By fine-tuning the TTS model on audio-visual data, we achieve strong audiovisual consistency. We propose a Dual Speaker Encoder to effectively mitigate inter-language interference in cross-lingual speech synthesis and explore the application of video dubbing in video translation scenarios. Experimental results show that SyncVoice achieves high-fidelity speech generation with strong synchronization performance, demonstrating its potential in video dubbing tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Subtitles and Audiovisual Media
