VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection
Joanna Hong, Minsu Kim, Yong Man Ro

TL;DR
This paper presents VisageSynTalk, a novel approach for silent talking face video-to-speech synthesis that effectively disentangles speech content and speaker identity, enabling high-quality speech generation even for unseen speakers.
Contribution
It introduces speech-visage feature selection and disentanglement techniques to improve unseen speaker video-to-speech synthesis performance.
Findings
Achieves high speech intelligibility for unseen speakers.
Effectively separates speech content from speaker identity.
Validated on multiple datasets showing improved synthesis quality.
Abstract
The goal of this work is to reconstruct speech from a silent talking face video. Recent studies have shown impressive performance on synthesizing speech from silent talking face videos. However, they have not explicitly considered on varying identity characteristics of different speakers, which place a challenge in the video-to-speech synthesis, and this becomes more critical in unseen-speaker settings. Our approach is to separate the speech content and the visage-style from a given silent talking face video. By guiding the model to independently focus on modeling the two representations, we can obtain the speech of high intelligibility from the model even when the input video of an unseen subject is given. To this end, we introduce speech-visage selection that separates the speech content and the speaker identity from the visual features of the input video. The disentangled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Generative Adversarial Networks and Image Synthesis
