From Inpainting to Editing: Unlocking Robust Mask-Free Visual Dubbing via Generative Bootstrapping
Xu He, Haoxian Zhang, Hejia Chen, Changyuan Zheng, Liyang Chen, Songlin Tang, Jiehui Huang, Xiaoqiang Liu, Pengfei Wan, Zhiyong Wu

TL;DR
This paper introduces X-Dub, a two-stage generative framework that uses diffusion transformers to enable mask-free visual dubbing, improving robustness, lip sync accuracy, and visual quality by generating high-fidelity pseudo data for training.
Contribution
The paper proposes a novel mask-free visual dubbing method using generative bootstrapping with diffusion transformers, eliminating artifacts caused by masking and enhancing robustness and quality.
Findings
Achieves state-of-the-art lip sync accuracy.
Demonstrates superior robustness to occlusions.
Provides a new benchmark for diverse dubbing scenarios.
Abstract
Audio-driven visual dubbing aims to synchronize a video's lip movements with new speech but is fundamentally challenged by the lack of ideal training data: paired videos differing only in lip motion. Existing methods circumvent this via mask-based inpainting. However, masking inevitably destroys spatiotemporal context, leading to identity drift and poor robustness (e.g., to occlusions), while also inducing lip-shape leakage that degrades lip sync. To bridge this gap, we propose X-Dub, a novel two-stage generative bootstrapping framework leveraging powerful Diffusion Transformers to unlock mask-free dubbing. Our core insight is to repurpose a mask-based inpainting model exclusively as a dedicated data generator to synthesize scalable, high-fidelity pseudo-paired data, which is subsequently utilized to train and bootstrap a robust, mask-free editing model as the final video dubber. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis
