Lip to Speech Synthesis with Visual Context Attentional GAN
Minsu Kim, Joanna Hong, Yong Man Ro

TL;DR
This paper introduces VCA-GAN, a novel lip-to-speech synthesis model that integrates local lip movements and global visual context using attention mechanisms and contrastive learning, significantly improving synthesis quality especially in multi-speaker scenarios.
Contribution
The paper presents a new GAN architecture with visual context attention and synchronization learning for improved lip-to-speech synthesis.
Findings
VCA-GAN outperforms existing methods in speech synthesis quality.
Effective modeling of local lip movements and global visual context.
Successful multi-speaker speech synthesis with high accuracy.
Abstract
In this paper, we propose a novel lip-to-speech generative adversarial network, Visual Context Attentional GAN (VCA-GAN), which can jointly model local and global lip movements during speech synthesis. Specifically, the proposed VCA-GAN synthesizes the speech from local lip visual features by finding a mapping function of viseme-to-phoneme, while global visual context is embedded into the intermediate layers of the generator to clarify the ambiguity in the mapping induced by homophene. To achieve this, a visual context attention module is proposed where it encodes global representations from the local visual features, and provides the desired global visual context corresponding to the given coarse speech representation to the generator through audio-visual attention. In addition to the explicit modelling of local and global visual representations, synchronization learning is introduced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis
MethodsContrastive Learning
