Towards Realistic Visual Dubbing with Heterogeneous Sources
Tianyi Xie, Liucheng Liao, Cheng Bi, Benlai Tang, Xiang Yin, Jianfei, Yang, Mingjie Wang, Jiali Yao, Yang Zhang, Zejun Ma

TL;DR
This paper introduces a two-stage framework for realistic visual dubbing that effectively leverages heterogeneous data sources, improving lip synchronization and speaker identity preservation in talking head videos.
Contribution
The proposed method employs facial landmarks as intermediate priors and disentangles lip movement prediction from head generation, enabling flexible use of diverse data and better personalization.
Findings
Outperforms state-of-the-art in realism and synchronization.
Supports fine-tuning for individual speakers.
Effectively utilizes heterogeneous data sources.
Abstract
The task of few-shot visual dubbing focuses on synchronizing the lip movements with arbitrary speech input for any talking head video. Albeit moderate improvements in current approaches, they commonly require high-quality homologous data sources of videos and audios, thus causing the failure to leverage heterogeneous data sufficiently. In practice, it may be intractable to collect the perfect homologous data in some cases, for example, audio-corrupted or picture-blurry videos. To explore this kind of data and support high-fidelity few-shot visual dubbing, in this paper, we novelly propose a simple yet efficient two-stage framework with a higher flexibility of mining heterogeneous data. Specifically, our two-stage paradigm employs facial landmarks as intermediate prior of latent representations and disentangles the lip movements prediction from the core task of realistic talking head…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
