SyncAnyone: Implicit Disentanglement via Progressive Self-Correction for Lip-Syncing in the wild

Xindi Zhang; Dechao Meng; Steven Xiao; Qi Wang; Peng Zhang; Bang Zhang

arXiv:2512.21736·cs.CV·February 9, 2026

SyncAnyone: Implicit Disentanglement via Progressive Self-Correction for Lip-Syncing in the wild

Xindi Zhang, Dechao Meng, Steven Xiao, Qi Wang, Peng Zhang, Bang Zhang

PDF

Open Access

TL;DR

SyncAnyone introduces a two-stage framework combining diffusion-based inpainting and mask-free tuning to improve lip-sync accuracy and visual fidelity in in-the-wild video dubbing, addressing artifacts and maintaining identity.

Contribution

The paper presents a novel two-stage learning approach that enhances lip-syncing accuracy and visual quality by integrating diffusion models with a mask-free tuning pipeline.

Findings

01

Achieves state-of-the-art visual quality and temporal coherence.

02

Maintains high identity preservation in challenging scenarios.

03

Effectively reduces artifacts and background inconsistencies.

Abstract

High-quality AI-powered video dubbing demands precise audio-lip synchronization, high-fidelity visual generation, and faithful preservation of identity and background. Most existing methods rely on a mask-based training strategy, where the mouth region is masked in talking-head videos, and the model learns to synthesize lip movements from corrupted inputs and target audios. While this facilitates lip-sync accuracy, it disrupts spatiotemporal context, impairing performance on dynamic facial motions and causing instability in facial structure and background consistency. To overcome this limitation, we propose SyncAnyone, a novel two-stage learning framework that achieves accurate motion modeling and high visual fidelity simultaneously. In Stage 1, we train a diffusion-based video transformer for masked mouth inpainting, leveraging its strong spatiotemporal modeling to generate accurate,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Generative Adversarial Networks and Image Synthesis