From Inpainting to Editing: Unlocking Robust Mask-Free Visual Dubbing via Generative Bootstrapping

Xu He; Haoxian Zhang; Hejia Chen; Changyuan Zheng; Liyang Chen; Songlin Tang; Jiehui Huang; Xiaoqiang Liu; Pengfei Wan; Zhiyong Wu

arXiv:2512.25066·cs.CV·March 25, 2026

From Inpainting to Editing: Unlocking Robust Mask-Free Visual Dubbing via Generative Bootstrapping

Xu He, Haoxian Zhang, Hejia Chen, Changyuan Zheng, Liyang Chen, Songlin Tang, Jiehui Huang, Xiaoqiang Liu, Pengfei Wan, Zhiyong Wu

PDF

Open Access 1 Models

TL;DR

This paper introduces X-Dub, a two-stage generative framework that uses diffusion transformers to enable mask-free visual dubbing, improving robustness, lip sync accuracy, and visual quality by generating high-fidelity pseudo data for training.

Contribution

The paper proposes a novel mask-free visual dubbing method using generative bootstrapping with diffusion transformers, eliminating artifacts caused by masking and enhancing robustness and quality.

Findings

01

Achieves state-of-the-art lip sync accuracy.

02

Demonstrates superior robustness to occlusions.

03

Provides a new benchmark for diverse dubbing scenarios.

Abstract

Audio-driven visual dubbing aims to synchronize a video's lip movements with new speech but is fundamentally challenged by the lack of ideal training data: paired videos differing only in lip motion. Existing methods circumvent this via mask-based inpainting. However, masking inevitably destroys spatiotemporal context, leading to identity drift and poor robustness (e.g., to occlusions), while also inducing lip-shape leakage that degrades lip sync. To bridge this gap, we propose X-Dub, a novel two-stage generative bootstrapping framework leveraging powerful Diffusion Transformers to unlock mask-free dubbing. Our core insight is to repurpose a mask-based inpainting model exclusively as a dedicated data generator to synthesize scalable, high-fidelity pseudo-paired data, which is subsequently utilized to train and bootstrap a robust, mask-free editing model as the final video dubber. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
KlingTeam/X-Dub
model· ♡ 17
♡ 17

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis