Multi-modal and Multi-scale Spatial Environment Understanding for   Immersive Visual Text-to-Speech

Rui Liu; Shuwei He; Yifan Hu; Haizhou Li

arXiv:2412.11409·cs.CV·January 16, 2025

Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech

Rui Liu, Shuwei He, Yifan Hu, Haizhou Li

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

This paper introduces M2SE-VTTS, a novel multi-modal and multi-scale approach that leverages RGB and Depth images to improve spatial environment understanding for immersive visual text-to-speech synthesis, outperforming existing methods.

Contribution

The paper proposes a new multi-modal, multi-scale framework that integrates RGB and Depth data with local-global spatial modeling for enhanced environmental speech synthesis.

Findings

01

Outperforms advanced baselines in objective evaluations

02

Achieves superior subjective quality in speech synthesis

03

Effectively models local and global spatial interactions

Abstract

Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt to synthesize the reverberant speech for the spoken content. The challenge of this task lies in understanding the spatial environment from the image. Many attempts have been made to extract global spatial visual information from the RGB space of an spatial image. However, local and depth image information are crucial for understanding the spatial environment, which previous works have ignored. To address the issues, we propose a novel multi-modal and multi-scale spatial environment understanding scheme to achieve immersive VTTS, termed M2SE-VTTS. The multi-modal aims to take both the RGB and Depth spaces of the spatial image to learn more comprehensive spatial information, and the multi-scale seeks to model the local and global spatial knowledge simultaneously. Specifically, we first split the RGB and Depth…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ai-s2-lab/m2se-vtts
noneOfficial

Models

🤗
he-shuwei/M2SE-VTTS
model· ♡ 1
♡ 1

Videos

Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech· underline

Taxonomy

TopicsSpeech and dialogue systems · Human Motion and Animation · Video Analysis and Summarization

MethodsADaptive gradient method with the OPTimal convergence rate