Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech
Rui Liu, Shuwei He, Yifan Hu, Haizhou Li

TL;DR
This paper introduces M2SE-VTTS, a novel multi-modal and multi-scale approach that leverages RGB and Depth images to improve spatial environment understanding for immersive visual text-to-speech synthesis, outperforming existing methods.
Contribution
The paper proposes a new multi-modal, multi-scale framework that integrates RGB and Depth data with local-global spatial modeling for enhanced environmental speech synthesis.
Findings
Outperforms advanced baselines in objective evaluations
Achieves superior subjective quality in speech synthesis
Effectively models local and global spatial interactions
Abstract
Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt to synthesize the reverberant speech for the spoken content. The challenge of this task lies in understanding the spatial environment from the image. Many attempts have been made to extract global spatial visual information from the RGB space of an spatial image. However, local and depth image information are crucial for understanding the spatial environment, which previous works have ignored. To address the issues, we propose a novel multi-modal and multi-scale spatial environment understanding scheme to achieve immersive VTTS, termed M2SE-VTTS. The multi-modal aims to take both the RGB and Depth spaces of the spatial image to learn more comprehensive spatial information, and the multi-scale seeks to model the local and global spatial knowledge simultaneously. Specifically, we first split the RGB and Depth…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech and dialogue systems · Human Motion and Animation · Video Analysis and Summarization
MethodsADaptive gradient method with the OPTimal convergence rate
