SwapTalk: Audio-Driven Talking Face Generation with One-Shot   Customization in Latent Space

Zeren Zhang; Haibo Qin; Jiayu Huang; Yixin Li; Hui Lin; Yitao Duan,; Jinwen Ma

arXiv:2405.05636·cs.CV·May 10, 2024

SwapTalk: Audio-Driven Talking Face Generation with One-Shot Customization in Latent Space

Zeren Zhang, Haibo Qin, Jiayu Huang, Yixin Li, Hui Lin, Yitao Duan,, Jinwen Ma

PDF

Open Access

TL;DR

SwapTalk is a unified framework that enhances talking face generation by performing face swapping and lip synchronization in a shared latent space, improving video quality, synchronization, and identity consistency.

Contribution

It introduces a novel latent space approach for combined face swapping and lip sync, with identity loss and expert discriminator supervision for better generalization and quality.

Findings

01

Outperforms existing methods in video quality and synchronization

02

Achieves higher face swapping fidelity and identity consistency

03

Effective on asynchronous audio-video scenarios

Abstract

Combining face swapping with lip synchronization technology offers a cost-effective solution for customized talking face generation. However, directly cascading existing models together tends to introduce significant interference between tasks and reduce video clarity because the interaction space is limited to the low-level semantic RGB space. To address this issue, we propose an innovative unified framework, SwapTalk, which accomplishes both face swapping and lip synchronization tasks in the same latent space. Referring to recent work on face generation, we choose the VQ-embedding space due to its excellent editability and fidelity performance. To enhance the framework's generalization capabilities for unseen identities, we incorporate identity loss during the training of the face swapping module. Additionally, we introduce expert discriminator supervision within the latent space…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis