One-shot Talking Face Generation from Single-speaker Audio-Visual   Correlation Learning

Suzhen Wang; Lincheng Li; Yu Ding; Xin Yu

arXiv:2112.02749·cs.CV·December 7, 2021

One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning

Suzhen Wang, Lincheng Li, Yu Ding, Xin Yu

PDF

Open Access 5 Models 1 Video

TL;DR

This paper introduces a novel framework for one-shot talking face generation that leverages consistent speaker-specific audio-visual correlations, enabling more authentic and synchronized mouth movements in generated videos.

Contribution

It proposes the Audio-Visual Correlation Transformer (AVCT) that generalizes across speakers and a motion field transfer module to improve lip-sync and visual quality.

Findings

01

Outperforms state-of-the-art in visual quality

02

Produces more authentic mouth movements

03

Achieves better lip-sync accuracy

Abstract

Audio-driven one-shot talking face generation methods are usually trained on video resources of various persons. However, their created videos often suffer unnatural mouth shapes and asynchronous lips because those methods struggle to learn a consistent speech style from different speakers. We observe that it would be much easier to learn a consistent speech style from a specific speaker, which leads to authentic mouth movements. Hence, we propose a novel one-shot talking face generation framework by exploring consistent correlations between audio and visual motions from a specific speaker and then transferring audio-driven motion fields to a reference image. Specifically, we develop an Audio-Visual Correlation Transformer (AVCT) that aims to infer talking motions represented by keypoint based dense motion fields from an input audio. In particular, considering audio may come from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

One-Shot Talking Face Generation from Single-Speaker Audio-Visual Correlation Learning· underline

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Softmax · Residual Connection · Adam · Dropout · Position-Wise Feed-Forward Layer · Layer Normalization · Dense Connections