Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in   Transformers

Yasheng Sun; Hang Zhou; Kaisiyuan Wang; Qianyi Wu; Zhibin Hong,; Jingtuo Liu; Errui Ding; Jingdong Wang; Ziwei Liu; Hideki Koike

arXiv:2212.04970·cs.CV·December 12, 2022

Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers

Yasheng Sun, Hang Zhou, Kaisiyuan Wang, Qianyi Wu, Zhibin Hong,, Jingtuo Liu, Errui Ding, Jingdong Wang, Ziwei Liu, Hideki Koike

PDF

Open Access

TL;DR

This paper introduces AV-CAT, a transformer-based framework that accurately predicts and inpaints mouth shapes for lip-syncing, leveraging audio-visual context to produce realistic talking face videos with minimal facial deformation.

Contribution

The work presents a novel hybrid convolution-Transformer model with an attention-based fusion strategy specifically designed for mouth inpainting in lip-sync generation, focusing on realistic and targeted facial region editing.

Findings

01

Achieves high-fidelity lip-synced results for arbitrary subjects.

02

Outperforms existing methods in realism and accuracy.

03

Demonstrates effective use of audio-visual context in lip movement prediction.

Abstract

Previous studies have explored generating accurately lip-synced talking faces for arbitrary targets given audio conditions. However, most of them deform or generate the whole facial area, leading to non-realistic results. In this work, we delve into the formulation of altering only the mouth shapes of the target person. This requires masking a large percentage of the original image and seamlessly inpainting it with the aid of audio and reference frames. To this end, we propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework, which produces accurate lip-sync with photo-realistic quality by predicting the masked mouth shapes. Our key insight is to exploit desired contextual information provided in audio and visual modalities thoroughly with delicately designed Transformers. Specifically, we propose a convolution-Transformer hybrid backbone and design an attention-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Generative Adversarial Networks and Image Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Label Smoothing · Dropout · Byte Pair Encoding · Inpainting · Linear Layer · Dense Connections · Residual Connection · Adam