TL;DR
This paper introduces a novel single-model video dubbing method using a joint audio-visual diffusion model and lightweight LoRA conditioning, enabling high-quality, synchronized dubbing with preserved speaker identity.
Contribution
The work presents a new approach that adapts a foundational audio-visual diffusion model for video dubbing using LoRA, simplifying the pipeline and improving robustness.
Findings
Produces high-quality dubbed videos with better lip sync.
Preserves speaker identity and visual fidelity.
Outperforms existing dubbing pipelines in robustness.
Abstract
Audio-Visual Foundation Models, which are pretrained to jointly generate sound and visual content, have recently shown an unprecedented ability to model multi-modal generation and editing, opening new opportunities for downstream tasks. Among these tasks, video dubbing could greatly benefit from such priors, yet most existing solutions still rely on complex, task-specific pipelines that struggle in real-world settings. In this work, we introduce a single-model approach that adapts a foundational audio-video diffusion model for video-to-video dubbing via a lightweight LoRA. The LoRA enables the model to condition on an input audio-video while jointly generating translated audio and synchronized facial motion. To train this LoRA, we leverage the generative model itself to synthesize paired multilingual videos of the same speaker. Specifically, we generate multilingual videos with language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
