Audio-Driven Talking Face Generation with Blink Embedding and Hash Grid Landmarks Encoding
Yuhui Zhang, Hui Yu, Wei Liang, Sunjie Zhang

TL;DR
This paper introduces a novel method for generating high-fidelity talking face videos by combining blink embedding, hash grid landmarks encoding, and neural radiance fields to improve mouth movement accuracy and realism.
Contribution
It presents an automatic approach that integrates facial and audio features using a Dynamic Landmark Transformer to enhance talking face generation quality.
Findings
Outperforms existing methods in fidelity and realism.
Effectively captures mouth movements and facial expressions.
Produces lifelike 3D talking portraits.
Abstract
Dynamic Neural Radiance Fields (NeRF) have demonstrated considerable success in generating high-fidelity 3D models of talking portraits. Despite significant advancements in the rendering speed and generation quality, challenges persist in accurately and efficiently capturing mouth movements in talking portraits. To tackle this challenge, we propose an automatic method based on blink embedding and hash grid landmarks encoding in this study, which can substantially enhance the fidelity of talking faces. Specifically, we leverage facial features encoded as conditional features and integrate audio features as residual terms into our model through a Dynamic Landmark Transformer. Furthermore, we employ neural radiance fields to model the entire face, resulting in a lifelike face representation. Experimental evaluations have validated the superiority of our approach to existing methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
