Multi-Source Spatial Knowledge Understanding for Immersive Visual   Text-to-Speech

Shuwei He; Rui Liu

arXiv:2410.14101·cs.SD·December 24, 2024

Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech

Shuwei He, Rui Liu

PDF

Open Access 1 Repo

TL;DR

This paper introduces MS2KU-VTTS, a novel approach that integrates multi-source spatial knowledge such as depth, speaker position, and semantics to improve immersive visual text-to-speech synthesis, surpassing existing methods.

Contribution

The paper proposes a new multi-source spatial knowledge understanding scheme with a serial interaction mechanism for immersive VTTS, effectively combining diverse environmental cues.

Findings

01

Outperforms existing baselines in immersive speech generation

02

Effectively integrates multi-source spatial knowledge

03

Enhances the immersive experience in VTTS

Abstract

Visual Text-to-Speech (VTTS) aims to take the environmental image as the prompt to synthesize reverberant speech for the spoken content. Previous works focus on the RGB modality for global environmental modeling, overlooking the potential of multi-source spatial knowledge like depth, speaker position, and environmental semantics. To address these issues, we propose a novel multi-source spatial knowledge understanding scheme for immersive VTTS, termed MS2KU-VTTS. Specifically, we first prioritize RGB image as the dominant source and consider depth image, speaker position knowledge from object detection, and Gemini-generated semantic captions as supplementary sources. Afterwards, we propose a serial interaction mechanism to effectively integrate both dominant and supplementary sources. The resulting multi-source knowledge is dynamically integrated based on the respective contributions of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ms2ku-vtts/ms2ku-vtts
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Human Motion and Animation · Subtitles and Audiovisual Media

MethodsFocus