Assessing Identity Leakage in Talking Face Generation: Metrics and Evaluation Framework
Dogucan Yaman, Fevziye Irem Eyiokur, Haz{\i}m Kemal Ekenel, Alexander Waibel

TL;DR
This paper presents a comprehensive evaluation framework for detecting and quantifying lip leakage in talking face generation models, addressing limitations of standard metrics and test setups.
Contribution
It introduces a systematic, model-agnostic evaluation methodology with new metrics and test setups to better assess identity leakage in talking face synthesis.
Findings
The framework effectively detects lip leakage across different models.
Derived metrics provide quantitative measures of lip-sync fidelity.
Insights into reference selection impact on leakage are provided.
Abstract
Video editing-based talking face generation aims to preserve video details such as pose, lighting, and gestures while modifying only lip motion, often using an identity reference image to maintain speaker consistency. However, this mechanism can introduce lip leakage, where generated lips are influenced by the reference image rather than solely by the driving audio. Such leakage is difficult to detect with standard metrics and conventional test setup. To address this, we propose a systematic evaluation methodology to analyze and quantify lip leakage. Our framework employs three complementary test setups: silent-input generation, mismatched audio-video pairing, and matched audio-video synthesis. We also introduce derived metrics including lip-sync discrepancy and silent-audio-based lip-sync scores. In addition, we study how different identity reference selections affect leakage,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis
