Identity-Preserving Video Dubbing Using Motion Warping
Runzhen Liu, Qinjie Lin, Yunfei Liu, Lijian Lin, Ye Zhu, Yu Li, Chuhua, Xian, Fa-Ting Hong

TL;DR
This paper introduces IPTalker, a transformer-based framework for video dubbing that achieves high-fidelity identity preservation and lip-sync accuracy by dynamically aligning audio cues with reference visuals and refining the generated videos.
Contribution
The paper presents a novel transformer-based alignment mechanism combined with motion warping and refinement strategies to improve identity preservation in video dubbing.
Findings
Outperforms existing methods in realism and lip-sync accuracy
Achieves superior identity retention in generated videos
Establishes new state-of-the-art in identity-consistent video dubbing
Abstract
Video dubbing aims to synthesize realistic, lip-synced videos from a reference video and a driving audio signal. Although existing methods can accurately generate mouth shapes driven by audio, they often fail to preserve identity-specific features, largely because they do not effectively capture the nuanced interplay between audio cues and the visual attributes of reference identity . As a result, the generated outputs frequently lack fidelity in reproducing the unique textural and structural details of the reference identity. To address these limitations, we propose IPTalker, a novel and robust framework for video dubbing that achieves seamless alignment between driving audio and reference identity while ensuring both lip-sync accuracy and high-fidelity identity preservation. At the core of IPTalker is a transformer-based alignment mechanism designed to dynamically capture and model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Human Motion and Animation
