FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing

Gaoxiang Cong; Liang Li; Jiadong Pan; Zhedong Zhang; Amin Beheshti; Anton van den Hengel; Yuankai Qi; Qingming Huang

arXiv:2505.01263·cs.MM·August 26, 2025

FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing

Gaoxiang Cong, Liang Li, Jiadong Pan, Zhedong Zhang, Amin Beheshti, Anton van den Hengel, Yuankai Qi, Qingming Huang

PDF

Open Access

TL;DR

FlowDubber leverages LLM-based semantic-aware learning and flow matching to improve movie dubbing by enhancing lip-sync, pronunciation, and acoustic quality, outperforming existing methods on key benchmarks.

Contribution

The paper introduces FlowDubber, a novel LLM-based flow matching architecture that integrates semantic-aware learning, dual contrastive alignment, and voice-enhanced flow matching for superior dubbing quality.

Findings

01

Outperforms state-of-the-art methods on primary benchmarks.

02

Achieves high-quality audio-visual synchronization and pronunciation.

03

Enhances acoustic quality through flow-based voice enhancement.

Abstract

Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects while preserving the vocal timbre of a given brief reference audio. Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality. To address these issues, we propose a large language model (LLM) based flow matching architecture for dubbing, named FlowDubber, which achieves high-quality audio-visual sync and pronunciation by incorporating a large speech language model and dual contrastive aligning while achieving better acoustic quality via the proposed voice-enhanced flow matching than previous works. First, we introduce Qwen2.5 as the backbone of LLM to learn the in-context sequence from movie scripts and reference audio. Then, the proposed semantic-aware learning focuses on capturing LLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing

MethodsFocus · Contrastive Language-Image Pre-training · ALIGN