SyncAnyone: Implicit Disentanglement via Progressive Self-Correction for Lip-Syncing in the wild
Xindi Zhang, Dechao Meng, Steven Xiao, Qi Wang, Peng Zhang, Bang Zhang

TL;DR
SyncAnyone introduces a two-stage framework combining diffusion-based inpainting and mask-free tuning to improve lip-sync accuracy and visual fidelity in in-the-wild video dubbing, addressing artifacts and maintaining identity.
Contribution
The paper presents a novel two-stage learning approach that enhances lip-syncing accuracy and visual quality by integrating diffusion models with a mask-free tuning pipeline.
Findings
Achieves state-of-the-art visual quality and temporal coherence.
Maintains high identity preservation in challenging scenarios.
Effectively reduces artifacts and background inconsistencies.
Abstract
High-quality AI-powered video dubbing demands precise audio-lip synchronization, high-fidelity visual generation, and faithful preservation of identity and background. Most existing methods rely on a mask-based training strategy, where the mouth region is masked in talking-head videos, and the model learns to synthesize lip movements from corrupted inputs and target audios. While this facilitates lip-sync accuracy, it disrupts spatiotemporal context, impairing performance on dynamic facial motions and causing instability in facial structure and background consistency. To overcome this limitation, we propose SyncAnyone, a novel two-stage learning framework that achieves accurate motion modeling and high visual fidelity simultaneously. In Stage 1, we train a diffusion-based video transformer for masked mouth inpainting, leveraging its strong spatiotemporal modeling to generate accurate,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Generative Adversarial Networks and Image Synthesis
