Lip to Speech Synthesis with Visual Context Attentional GAN

Minsu Kim; Joanna Hong; Yong Man Ro

arXiv:2204.01726·cs.CV·April 6, 2022·24 cites

Lip to Speech Synthesis with Visual Context Attentional GAN

Minsu Kim, Joanna Hong, Yong Man Ro

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces VCA-GAN, a novel lip-to-speech synthesis model that integrates local lip movements and global visual context using attention mechanisms and contrastive learning, significantly improving synthesis quality especially in multi-speaker scenarios.

Contribution

The paper presents a new GAN architecture with visual context attention and synchronization learning for improved lip-to-speech synthesis.

Findings

01

VCA-GAN outperforms existing methods in speech synthesis quality.

02

Effective modeling of local lip movements and global visual context.

03

Successful multi-speaker speech synthesis with high accuracy.

Abstract

In this paper, we propose a novel lip-to-speech generative adversarial network, Visual Context Attentional GAN (VCA-GAN), which can jointly model local and global lip movements during speech synthesis. Specifically, the proposed VCA-GAN synthesizes the speech from local lip visual features by finding a mapping function of viseme-to-phoneme, while global visual context is embedded into the intermediate layers of the generator to clarify the ambiguity in the mapping induced by homophene. To achieve this, a visual context attention module is proposed where it encodes global representations from the local visual features, and provides the desired global visual context corresponding to the given coarse speech representation to the generator through audio-visual attention. In addition to the explicit modelling of local and global visual representations, synchronization learning is introduced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ms-dot-k/Visual-Context-Attentional-GAN
pytorchOfficial

Videos

Lip to Speech Synthesis with Visual Context Attentional GAN· slideslive

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis

MethodsContrastive Learning