Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss
Lele Chen, Ross K. Maddox, Zhiyao Duan, Chenliang Xu

TL;DR
This paper introduces a cascade GAN framework for talking face video generation that improves robustness and visual quality by transferring audio to facial landmarks before video synthesis, and employs novel loss and discriminator designs.
Contribution
It proposes a hierarchical approach with a dynamic pixel-wise loss and a sequence-aware discriminator to enhance synchronization and image sharpness in talking face videos.
Findings
Outperforms state-of-the-art methods in quantitative metrics
Produces more realistic and synchronized talking face videos
Demonstrates robustness across various face shapes and audio conditions
Abstract
We devise a cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions. Instead of learning a direct mapping from audio to video frames, we propose first to transfer audio to high-level structure, i.e., the facial landmarks, and then to generate video frames conditioned on the landmarks. Compared to a direct audio-to-image approach, our cascade approach avoids fitting spurious correlations between audiovisual signals that are irrelevant to the speech content. We, humans, are sensitive to temporal discontinuities and subtle artifacts in video. To avoid those pixel jittering problems and to enforce the network to focus on audiovisual-correlated regions, we propose a novel dynamically adjustable pixel-wise loss with an attention mechanism. Furthermore, to generate a sharper image with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
MethodsConvolution · Dogecoin Customer Service Number +1-833-534-1729
