FunCineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes

Jiaxuan Liu; Yang Xiang; Han Zhao; Xiangang Li; Zhenhua Ling

arXiv:2601.14777·cs.CV·January 22, 2026

FunCineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes

Jiaxuan Liu, Yang Xiang, Han Zhao, Xiangang Li, Zhenhua Ling

PDF

Open Access 2 Models

TL;DR

FunCineForge introduces a comprehensive dataset and a novel model for zero-shot movie dubbing, significantly improving lip sync, speech quality, and emotional expressiveness across diverse cinematic scenes.

Contribution

It provides the first large-scale Chinese dubbing dataset with rich annotations and a new MLLM-based model for diverse cinematic scene dubbing.

Findings

01

Outperforms state-of-the-art in audio quality and lip sync

02

Constructed the first Chinese TV dubbing dataset with rich annotations

03

Effective across monologue, narration, dialogue, and multi-speaker scenes

Abstract

Movie dubbing is the task of synthesizing speech from scripts conditioned on video scenes, requiring accurate lip sync, faithful timbre transfer, and proper modeling of character identity and emotion. However, existing methods face two major limitations: (1) high-quality multimodal dubbing datasets are limited in scale, suffer from high word error rates, contain sparse annotations, rely on costly manual labeling, and are restricted to monologue scenes, all of which hinder effective model training; (2) existing dubbing models rely solely on the lip region to learn audio-visual alignment, which limits their applicability to complex live-action cinematic scenes, and exhibit suboptimal performance in lip sync, speech quality, and emotional expressiveness. To address these issues, we propose FunCineForge, which comprises an end-to-end production pipeline for large-scale dubbing datasets and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis