I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception

Jiawei Zhang; Tian-Hao Zhang; Jun Wang; Jiaran Gao; Xinyuan Qian; Xu-Cheng Yin

arXiv:2411.13314·cs.SD·September 4, 2025

I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception

Jiawei Zhang, Tian-Hao Zhang, Jun Wang, Jiaran Gao, Xinyuan Qian, Xu-Cheng Yin

PDF

Open Access

TL;DR

This paper introduces I2TTS, a novel multi-modal TTS system that incorporates visual scene prompts and reverberation refinement to produce spatially immersive and contextually accurate speech synthesis for virtual environments.

Contribution

The paper presents a new scene prompt encoder and reverberation refinement technique that enhance spatial perception in speech synthesis, advancing the realism of virtual auditory experiences.

Findings

01

Achieves high-quality scene and spatial matching

02

Maintains speech naturalness while enhancing immersion

03

Demonstrates effectiveness in virtual reality contexts

Abstract

Controlling the style and characteristics of speech synthesis is crucial for adapting the output to specific contexts and user requirements. Previous Text-to-speech (TTS) works have focused primarily on the technical aspects of producing natural-sounding speech, such as intonation, rhythm, and clarity. However, they overlook the fact that there is a growing emphasis on spatial perception of synthesized speech, which may provide immersive experience in gaming and virtual reality. To solve this issue, in this paper, we present a novel multi-modal TTS approach, namely Image-indicated Immersive Text-to-speech Synthesis (I2TTS). Specifically, we introduce a scene prompt encoder that integrates visual scene prompts directly into the synthesis pipeline to control the speech generation process. Additionally, we propose a reverberation classification and refinement technique that adjusts the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Human Motion and Animation