VSpeechLM: A Visual Speech Language Model for Visual Text-to-Speech Task

Yuyue Wang; Xin Cheng; Yihan Wu; Xihua Wang; Jinchuan Tian; Ruihua Song

arXiv:2511.22229·cs.MM·December 1, 2025

VSpeechLM: A Visual Speech Language Model for Visual Text-to-Speech Task

Yuyue Wang, Xin Cheng, Yihan Wu, Xihua Wang, Jinchuan Tian, Ruihua Song

PDF

Open Access

TL;DR

VSpeechLM is a novel model that leverages speech large language models and a text-video aligner to generate high-quality, lip-synchronized speech from videos, improving over existing VisualTTS methods.

Contribution

The paper introduces VSpeechLM, a new Visual Speech Language Model that incorporates a text-video aligner and leverages SpeechLLM to enhance lip-synchronized speech generation.

Findings

01

Outperforms previous methods in quality, similarity, and synchronization

02

Effectively captures fine-grained phoneme-lip movement alignment

03

Generates more natural and synchronized speech in VisualTTS tasks

Abstract

The task of Visual Text-to-Speech (VisualTTS), also known as video dubbing, aims to generate speech synchronized with the lip movements in an input video, in additional to being consistent with the content of input text and cloning the timbre of a reference speech. Existing VisualTTS models typically adopt lightweight architectures and design specialized modules to achieve the above goals respectively, yet the speech quality is not satisfied due to the model capacity and the limited data in VisualTTS. Recently, speech large language models (SpeechLLM) show the robust ability to generate high-quality speech. But few work has been done to well leverage temporal cues from video input in generating lip-synchronized speech. To generate both high-quality and lip-synchronized speech in VisualTTS tasks, we propose a novel Visual Speech Language Model called VSpeechLM based upon a SpeechLLM. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Speech Recognition and Synthesis