DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based   Text-to-Speech for Dubbing

Neha Sahipjohn; Ashishkumar Gudmalwar; Nirmesh Shah; Pankaj Wasnik,; Rajiv Ratn Shah

arXiv:2406.08802·eess.AS·June 14, 2024·1 cites

DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing

Neha Sahipjohn, Ashishkumar Gudmalwar, Nirmesh Shah, Pankaj Wasnik,, Rajiv Ratn Shah

PDF

Open Access

TL;DR

DubWise introduces a multimodal LLM-based TTS system that controls speech duration to ensure lip-sync accuracy in dubbing, even across different languages and texts, by leveraging cross-modal attention and duration control.

Contribution

The paper presents a novel multimodal LLM-based TTS method that aligns speech with lip movements across languages, improving lip sync and naturalness in dubbing applications.

Findings

01

Effective lip sync in cross-lingual dubbing scenarios.

02

Improved naturalness over state-of-the-art methods.

03

Successful application on Lip2Wav-Chemistry and LRS2 datasets.

Abstract

Audio-visual alignment after dubbing is a challenging research problem. To this end, we propose a novel method, DubWise Multi-modal Large Language Model (LLM)-based Text-to-Speech (TTS), which can control the speech duration of synthesized speech in such a way that it aligns well with the speakers lip movements given in the reference video even when the spoken text is different or in a different language. To accomplish this, we propose to utilize cross-modal attention techniques in a pre-trained GPT-based TTS. We combine linguistic tokens from text, speaker identity tokens via a voice cloning network, and video tokens via a proposed duration controller network. We demonstrate the effectiveness of our system on the Lip2Wav-Chemistry and LRS2 datasets. Also, the proposed method achieves improved lip sync and naturalness compared to the SOTAs for the same language but different text (i.e.,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research