More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech
Michael Hassid, Michelle Tadmor Ramanovich, Brendan Shillingford,, Miaosen Wang, Ye Jia, Tal Remez

TL;DR
This paper introduces VDTTS, a visually-driven TTS model that uses video frames to generate speech with natural prosody and synchronization to the input video, improving realism in in-the-wild scenarios.
Contribution
The paper presents VDTTS, a novel TTS approach that incorporates visual information to enhance prosody and synchronization, especially in challenging real-world settings.
Findings
Produces well-synchronized speech approaching ground-truth quality
Demonstrates robustness to speaker ID swapping
Generates natural prosody with in-the-wild video content
Abstract
In this paper we present VDTTS, a Visually-Driven Text-to-Speech model. Motivated by dubbing, VDTTS takes advantage of video frames as an additional input alongside text, and generates speech that matches the video signal. We demonstrate how this allows VDTTS to, unlike plain TTS models, generate speech that not only has prosodic variations like natural pauses and pitch, but is also synchronized to the input video. Experimentally, we show our model produces well-synchronized outputs, approaching the video-speech synchronization quality of the ground-truth, on several challenging benchmarks including "in-the-wild" content from VoxCeleb2. Supplementary demo videos demonstrating video-speech synchronization, robustness to speaker ID swapping, and prosody, presented at the project page.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Video Analysis and Summarization
