More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech

Michael Hassid; Michelle Tadmor Ramanovich; Brendan Shillingford,; Miaosen Wang; Ye Jia; Tal Remez

arXiv:2111.10139·cs.CV·March 25, 2022

More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech

Michael Hassid, Michelle Tadmor Ramanovich, Brendan Shillingford,, Miaosen Wang, Ye Jia, Tal Remez

PDF

Open Access 1 Repo

TL;DR

This paper introduces VDTTS, a visually-driven TTS model that uses video frames to generate speech with natural prosody and synchronization to the input video, improving realism in in-the-wild scenarios.

Contribution

The paper presents VDTTS, a novel TTS approach that incorporates visual information to enhance prosody and synchronization, especially in challenging real-world settings.

Findings

01

Produces well-synchronized speech approaching ground-truth quality

02

Demonstrates robustness to speaker ID swapping

03

Generates natural prosody with in-the-wild video content

Abstract

In this paper we present VDTTS, a Visually-Driven Text-to-Speech model. Motivated by dubbing, VDTTS takes advantage of video frames as an additional input alongside text, and generates speech that matches the video signal. We demonstrate how this allows VDTTS to, unlike plain TTS models, generate speech that not only has prosodic variations like natural pauses and pitch, but is also synchronized to the input video. Experimentally, we show our model produces well-synchronized outputs, approaching the video-speech synchronization quality of the ground-truth, on several challenging benchmarks including "in-the-wild" content from VoxCeleb2. Supplementary demo videos demonstrating video-speech synchronization, robustness to speaker ID swapping, and prosody, presented at the project page.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

galaxycong/styledubber
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Video Analysis and Summarization