VividVoice: A Unified Framework for Scene-Aware Visually-Driven Speech Synthesis

Chengyuan Ma; Jiawei Jin; Ruijie Xiong; Chunxiang Jin; Canxiang Yan; Wenming Yang

arXiv:2602.02591·cs.SD·February 4, 2026

VividVoice: A Unified Framework for Scene-Aware Visually-Driven Speech Synthesis

Chengyuan Ma, Jiawei Jin, Ruijie Xiong, Chunxiang Jin, Canxiang Yan, Wenming Yang

PDF

Open Access

TL;DR

VividVoice introduces a comprehensive framework for scene-aware, visually-driven speech synthesis that effectively aligns visual scenes with speech characteristics, overcoming data and modality challenges to produce more immersive audio experiences.

Contribution

The paper presents VividVoice, a novel unified generative framework with a large-scale dataset and a new alignment module for improved scene-aware speech synthesis.

Findings

01

Outperforms baseline models in audio fidelity and clarity

02

Achieves fine-grained visual-to-audio alignment

03

Demonstrates strong multimodal consistency

Abstract

We introduce and define a novel task-Scene-Aware Visually-Driven Speech Synthesis, aimed at addressing the limitations of existing speech generation models in creating immersive auditory experiences that align with the real physical world. To tackle the two core challenges of data scarcity and modality decoupling, we propose VividVoice, a unified generative framework. First, we constructed a large-scale, high-quality hybrid multimodal dataset, Vivid-210K, which, through an innovative programmatic pipeline, establishes a strong correlation between visual scenes, speaker identity, and audio for the first time. Second, we designed a core alignment module, D-MSVA, which leverages a decoupled memory bank architecture and a cross-modal hybrid supervision strategy to achieve fine-grained alignment from visual scenes to timbre and environmental acoustic features. Both subjective and objective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Music Technology and Sound Studies