Audio-driven Talking Face Generation with Stabilized Synchronization   Loss

Dogucan Yaman; Fevziye Irem Eyiokur; Leonard B\"armann; Hazim; Kemal Ekenel; Alexander Waibel

arXiv:2307.09368·cs.CV·July 19, 2024

Audio-driven Talking Face Generation with Stabilized Synchronization Loss

Dogucan Yaman, Fevziye Irem Eyiokur, Leonard B\"armann, Hazim, Kemal Ekenel, Alexander Waibel

PDF

Open Access

TL;DR

This paper proposes a novel approach for talking face generation that improves lip synchronization and visual quality by addressing training instability and lip leaking issues with new loss functions and a silent-lip generator.

Contribution

Introduces stabilized synchronization loss and AVSyncNet, along with a silent-lip generator, to enhance lip sync accuracy and visual quality in talking face videos.

Findings

01

Outperforms state-of-the-art methods in visual quality

02

Achieves better lip synchronization accuracy

03

Validates contributions through ablation studies

Abstract

Talking face generation aims to create realistic videos with accurate lip synchronization and high visual quality, using given audio and reference video while preserving identity and visual characteristics. In this paper, we start by identifying several issues with existing synchronization learning methods. These involve unstable training, lip synchronization, and visual quality issues caused by lip-sync loss, SyncNet, and lip leaking from the identity reference. To address these issues, we first tackle the lip leaking problem by introducing a silent-lip generator, which changes the lips of the identity reference to alleviate leakage. We then introduce stabilized synchronization loss and AVSyncNet to overcome problems caused by lip-sync loss and SyncNet. Experiments show that our model outperforms state-of-the-art methods in both visual quality and lip synchronization. Comprehensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis

MethodsTriplet Loss