VisualSpeech: Enhancing Prosody Modeling in TTS Using Video
Shumin Que, Anton Ragni

TL;DR
This paper introduces VisualSpeech, a model that integrates visual and textual cues to improve prosody prediction in text-to-speech synthesis, resulting in more expressive speech outputs.
Contribution
The paper presents a novel approach that combines visual context with text for enhanced prosody modeling in TTS, which was under-explored in prior work.
Findings
Visual features improve prosodic modeling.
Enhanced expressiveness in synthesized speech.
Empirical validation shows significant gains.
Abstract
Text-to-Speech (TTS) synthesis faces the inherent challenge of producing multiple speech outputs with varying prosody given a single text input. While previous research has addressed this by predicting prosodic information from both text and speech, additional contextual information, such as video, remains under-utilized despite being available in many applications. This paper investigates the potential of integrating visual context to enhance prosody prediction. We propose a novel model, VisualSpeech, which incorporates visual and textual information for improving prosody generation in TTS. Empirical results indicate that incorporating visual features improves prosodic modeling, enhancing the expressiveness of the synthesized speech. Audio samples are available at https://ariameetgit.github.io/VISUALSPEECH-SAMPLES/.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
