SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model

Kaidi Wang; Yi He; Wenhao Guan; Weijie Wu; Hongwu Ding; Xiong Zhang; Di Wu; Meng Meng; Jian Luan; Lin Li; Qingyang Hong

arXiv:2512.05126·eess.AS·December 8, 2025

SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model

Kaidi Wang, Yi He, Wenhao Guan, Weijie Wu, Hongwu Ding, Xiong Zhang, Di Wu, Meng Meng, Jian Luan, Lin Li, Qingyang Hong

PDF

Open Access

TL;DR

SyncVoice is a novel framework that enhances video dubbing by integrating vision-augmented pretrained TTS models, improving speech naturalness and synchronization across languages and visual content.

Contribution

It introduces a vision-augmented TTS framework with a Dual Speaker Encoder for cross-lingual dubbing and demonstrates its effectiveness in high-fidelity, synchronized video dubbing.

Findings

01

Achieves high-fidelity speech with strong audiovisual synchronization

02

Effectively mitigates inter-language interference in cross-lingual synthesis

03

Demonstrates potential in video translation scenarios

Abstract

Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content. Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization, and are limited to monolingual settings. To address these challenges, we propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) model. By fine-tuning the TTS model on audio-visual data, we achieve strong audiovisual consistency. We propose a Dual Speaker Encoder to effectively mitigate inter-language interference in cross-lingual speech synthesis and explore the application of video dubbing in video translation scenarios. Experimental results show that SyncVoice achieves high-fidelity speech generation with strong synchronization performance, demonstrating its potential in video dubbing tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Subtitles and Audiovisual Media