Mask-Free Audio-driven Talking Face Generation for Enhanced Visual Quality and Identity Preservation
Dogucan Yaman, Fevziye Irem Eyiokur, Leonard B\"armann, Haz{\i}m Kemal Ekenel, Alexander Waibel

TL;DR
This paper introduces a mask-free method for audio-driven talking face generation that enhances visual quality and preserves identity without needing masked inputs or reference images, by transforming input faces to have closed mouths before lip adaptation.
Contribution
The proposed approach eliminates the need for masked input images and identity references, improving identity preservation and visual quality in talking face generation.
Findings
Outperforms state-of-the-art methods on LRS2 and HDTF datasets.
Maintains high visual quality and accurate lip synchronization.
Reduces information loss and identity mismatch issues.
Abstract
Audio-Driven Talking Face Generation aims at generating realistic videos of talking faces, focusing on accurate audio-lip synchronization without deteriorating any identity-related visual details. Recent state-of-the-art methods are based on inpainting, meaning that the lower half of the input face is masked, and the model fills the masked region by generating lips aligned with the given audio. Hence, to preserve identity-related visual details from the lower half, these approaches additionally require an unmasked identity reference image randomly selected from the same video. However, this common masking strategy suffers from (1) information loss in the input faces, significantly affecting the networks' ability to preserve visual quality and identity details, (2) variation between identity reference and input image degrading reconstruction performance, and (3) the identity reference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
