I2TTS: Image-indicated Immersive Text-to-speech Synthesis with Spatial Perception
Jiawei Zhang, Tian-Hao Zhang, Jun Wang, Jiaran Gao, Xinyuan Qian, Xu-Cheng Yin

TL;DR
This paper introduces I2TTS, a novel multi-modal TTS system that incorporates visual scene prompts and reverberation refinement to produce spatially immersive and contextually accurate speech synthesis for virtual environments.
Contribution
The paper presents a new scene prompt encoder and reverberation refinement technique that enhance spatial perception in speech synthesis, advancing the realism of virtual auditory experiences.
Findings
Achieves high-quality scene and spatial matching
Maintains speech naturalness while enhancing immersion
Demonstrates effectiveness in virtual reality contexts
Abstract
Controlling the style and characteristics of speech synthesis is crucial for adapting the output to specific contexts and user requirements. Previous Text-to-speech (TTS) works have focused primarily on the technical aspects of producing natural-sounding speech, such as intonation, rhythm, and clarity. However, they overlook the fact that there is a growing emphasis on spatial perception of synthesized speech, which may provide immersive experience in gaming and virtual reality. To solve this issue, in this paper, we present a novel multi-modal TTS approach, namely Image-indicated Immersive Text-to-speech Synthesis (I2TTS). Specifically, we introduce a scene prompt encoder that integrates visual scene prompts directly into the synthesis pipeline to control the speech generation process. Additionally, we propose a reverberation classification and refinement technique that adjusts the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Human Motion and Animation
