VisualSpeech: Enhancing Prosody Modeling in TTS Using Video

Shumin Que; Anton Ragni

arXiv:2501.19258·cs.CL·August 19, 2025

VisualSpeech: Enhancing Prosody Modeling in TTS Using Video

Shumin Que, Anton Ragni

PDF

Open Access

TL;DR

This paper introduces VisualSpeech, a model that integrates visual and textual cues to improve prosody prediction in text-to-speech synthesis, resulting in more expressive speech outputs.

Contribution

The paper presents a novel approach that combines visual context with text for enhanced prosody modeling in TTS, which was under-explored in prior work.

Findings

01

Visual features improve prosodic modeling.

02

Enhanced expressiveness in synthesized speech.

03

Empirical validation shows significant gains.

Abstract

Text-to-Speech (TTS) synthesis faces the inherent challenge of producing multiple speech outputs with varying prosody given a single text input. While previous research has addressed this by predicting prosodic information from both text and speech, additional contextual information, such as video, remains under-utilized despite being available in many applications. This paper investigates the potential of integrating visual context to enhance prosody prediction. We propose a novel model, VisualSpeech, which incorporates visual and textual information for improving prosody generation in TTS. Empirical results indicate that incorporating visual features improves prosodic modeling, enhancing the expressiveness of the synthesized speech. Audio samples are available at https://ariameetgit.github.io/VISUALSPEECH-SAMPLES/.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems