InstructDubber: Instruction-based Alignment for Zero-shot Movie Dubbing
Zhedong Zhang, Liang Li, Gaoxiang Cong, Chunshan Liu, Yuhan Gao, Xiaowan Wang, Tao Gu, Yuankai Qi

TL;DR
InstructDubber introduces an instruction-based approach for zero-shot movie dubbing that leverages multimodal large language models to improve lip synchronization and emotion alignment, overcoming visual domain limitations.
Contribution
It proposes a novel instruction-based framework utilizing large language models for robust, zero-shot movie dubbing with improved lip-sync and emotion-prosody alignment.
Findings
Outperforms state-of-the-art methods on major benchmarks.
Effective in both in-domain and zero-shot scenarios.
Improves lip synchronization and emotion alignment quality.
Abstract
Movie dubbing seeks to synthesize speech from a given script using a specific voice, while ensuring accurate lip synchronization and emotion-prosody alignment with the character's visual performance. However, existing alignment approaches based on visual features face two key limitations: (1)they rely on complex, handcrafted visual preprocessing pipelines, including facial landmark detection and feature extraction; and (2) they generalize poorly to unseen visual domains, often resulting in degraded alignment and dubbing quality. To address these issues, we propose InstructDubber, a novel instruction-based alignment dubbing method for both robust in-domain and zero-shot movie dubbing. Specifically, we first feed the video, script, and corresponding prompts into a multimodal large language model to generate natural language dubbing instructions regarding the speaking rate and emotion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Generative Adversarial Networks and Image Synthesis
