Lipschitz-Driven Noise Robustness in VQ-AE for High-Frequency Texture Repair in ID-Specific Talking Heads
Jian Yang, Xukun Wang, Wentao Wang, Guoming Li, Qihang Fang, Ruihong Yuan, Tianyang Wang, Xiaomei Zhang, Yeying Jin, Zhaoxin Fan

TL;DR
This paper introduces a Lipschitz-based theoretical framework for enhancing noise robustness in Vector Quantized AutoEncoders, enabling high-quality, identity-specific talking head generation with improved high-frequency texture detail and real-time efficiency.
Contribution
It develops a theoretical noise robustness bound for VQ-AE guided by Lipschitz continuity, and proposes a plug-and-play SOVQAE for improved denoising in ID-specific talking head generation.
Findings
Achieves state-of-the-art video quality and lip sync robustness.
Runs in real-time on consumer GPUs.
Requires only minimal additional training or resources.
Abstract
Audio-driven IDentity-specific Talking Head Generation (ID-specific THG) has shown increasing promise for applications in filmmaking and virtual reality. Existing approaches are generally constructed as end-to-end paradigms, and have achieved significant progress. However, they often struggle to capture high-frequency textures due to limited model capacity. To address these limitations, we adopt a simple yet efficient post-processing framework -- unlike previous studies that focus solely on end-to-end training -- guided by our theoretical insights. Specifically, leveraging the \textit{Lipschitz Continuity Theory} of neural networks, we prove a crucial noise tolerance property for the Vector Quantized AutoEncoder (VQ-AE), and establish the existence of a Noise Robustness Upper Bound (NRoUB). This insight reveals that we can efficiently obtain an identity-specific denoiser by training an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Advanced Data Compression Techniques
