Lipschitz-Driven Noise Robustness in VQ-AE for High-Frequency Texture Repair in ID-Specific Talking Heads

Jian Yang; Xukun Wang; Wentao Wang; Guoming Li; Qihang Fang; Ruihong Yuan; Tianyang Wang; Xiaomei Zhang; Yeying Jin; Zhaoxin Fan

arXiv:2410.00990·cs.CV·June 10, 2025

Lipschitz-Driven Noise Robustness in VQ-AE for High-Frequency Texture Repair in ID-Specific Talking Heads

Jian Yang, Xukun Wang, Wentao Wang, Guoming Li, Qihang Fang, Ruihong Yuan, Tianyang Wang, Xiaomei Zhang, Yeying Jin, Zhaoxin Fan

PDF

Open Access

TL;DR

This paper introduces a Lipschitz-based theoretical framework for enhancing noise robustness in Vector Quantized AutoEncoders, enabling high-quality, identity-specific talking head generation with improved high-frequency texture detail and real-time efficiency.

Contribution

It develops a theoretical noise robustness bound for VQ-AE guided by Lipschitz continuity, and proposes a plug-and-play SOVQAE for improved denoising in ID-specific talking head generation.

Findings

01

Achieves state-of-the-art video quality and lip sync robustness.

02

Runs in real-time on consumer GPUs.

03

Requires only minimal additional training or resources.

Abstract

Audio-driven IDentity-specific Talking Head Generation (ID-specific THG) has shown increasing promise for applications in filmmaking and virtual reality. Existing approaches are generally constructed as end-to-end paradigms, and have achieved significant progress. However, they often struggle to capture high-frequency textures due to limited model capacity. To address these limitations, we adopt a simple yet efficient post-processing framework -- unlike previous studies that focus solely on end-to-end training -- guided by our theoretical insights. Specifically, leveraging the \textit{Lipschitz Continuity Theory} of neural networks, we prove a crucial noise tolerance property for the Vector Quantized AutoEncoder (VQ-AE), and establish the existence of a Noise Robustness Upper Bound (NRoUB). This insight reveals that we can efficiently obtain an identity-specific denoiser by training an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Advanced Data Compression Techniques