Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech
Shuwei He, Rui Liu

TL;DR
This paper introduces MS2KU-VTTS, a novel approach that integrates multi-source spatial knowledge such as depth, speaker position, and semantics to improve immersive visual text-to-speech synthesis, surpassing existing methods.
Contribution
The paper proposes a new multi-source spatial knowledge understanding scheme with a serial interaction mechanism for immersive VTTS, effectively combining diverse environmental cues.
Findings
Outperforms existing baselines in immersive speech generation
Effectively integrates multi-source spatial knowledge
Enhances the immersive experience in VTTS
Abstract
Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt to synthesize reverberant speech for the spoken content. Previous works focus on the RGB modality for global environmental modeling, overlooking the potential of multi-source spatial knowledge like depth, speaker position, and environmental semantics. To address these issues, we propose a novel multi-source spatial knowledge understanding scheme for immersive VTTS, termed MS2KU-VTTS. Specifically, we first prioritize RGB image as the dominant source and consider depth image, speaker position knowledge from object detection, and Gemini-generated semantic captions as supplementary sources. Afterwards, we propose a serial interaction mechanism to effectively integrate both dominant and supplementary sources. The resulting multi-source knowledge is dynamically integrated based on the respective contributions of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Human Motion and Animation · Subtitles and Audiovisual Media
MethodsFocus
