JEAN: Joint Expression and Audio-guided NeRF-based Talking Face   Generation

Sai Tanmay Reddy Chakkera; Aggelina Chatziagapi; Dimitris Samaras

arXiv:2409.12156·cs.CV·September 19, 2024

JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation

Sai Tanmay Reddy Chakkera, Aggelina Chatziagapi, Dimitris Samaras

PDF

Open Access

TL;DR

JEAN is a novel NeRF-based method for generating realistic talking face videos that accurately synchronize lip movements with audio while preserving facial expressions and identity, trained on monocular videos without ground truth.

Contribution

It introduces a joint framework that disentangles audio and expression features using self-supervised and contrastive learning, enabling high-fidelity, synchronized talking face synthesis.

Findings

01

Achieves state-of-the-art lip synchronization accuracy

02

Demonstrates high-quality facial expression transfer

03

Operates effectively without ground truth data

Abstract

We introduce a novel method for joint expression and audio-guided talking face generation. Recent approaches either struggle to preserve the speaker identity or fail to produce faithful facial expressions. To address these challenges, we propose a NeRF-based network. Since we train our network on monocular videos without any ground truth, it is essential to learn disentangled representations for audio and expression. We first learn audio features in a self-supervised manner, given utterances from multiple subjects. By incorporating a contrastive learning technique, we ensure that the learned audio features are aligned to the lip motion and disentangled from the muscle motion of the rest of the face. We then devise a transformer-based architecture that learns expression features, capturing long-range facial expressions and disentangling them from the speech-specific mouth movements.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing

MethodsContrastive Learning