DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided   Speaker Embedding

Jeongsoo Choi; Joanna Hong; Yong Man Ro

arXiv:2308.07787·cs.SD·August 16, 2023

DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding

Jeongsoo Choi, Joanna Hong, Yong Man Ro

PDF

Open Access 2 Repos

TL;DR

This paper introduces DiffV2S, a diffusion-based video-to-speech synthesis model that uses vision-guided speaker embeddings extracted without audio during inference, achieving state-of-the-art results.

Contribution

The paper proposes a novel vision-guided speaker embedding extractor using self-supervised learning and prompt tuning, enabling speech synthesis solely from visual input.

Findings

01

Achieves state-of-the-art performance in video-to-speech synthesis

02

Maintains phoneme details and speaker identity in generated speech

03

Does not require audio information during inference

Abstract

Recent research has demonstrated impressive results in video-to-speech synthesis which involves reconstructing speech solely from visual input. However, previous works have struggled to accurately synthesize speech due to a lack of sufficient guidance for the model to infer the correct content with the appropriate sound. To resolve the issue, they have adopted an extra speaker embedding as a speaking style guidance from a reference auditory information. Nevertheless, it is not always possible to obtain the audio information from the corresponding video input, especially during the inference time. In this paper, we present a novel vision-guided speaker embedding extractor using a self-supervised pre-trained model and prompt tuning technique. In doing so, the rich speaker embedding information can be produced solely from input visual information, and the extra audio information is not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing