UniSync: Towards Generalizable and High-Fidelity Lip Synchronization for Challenging Scenarios
Ruidi Fan, Yang Zhou, Siyuan Wang, Tian Yu, Yutong Jiang, Xusheng Liu

TL;DR
UniSync is a novel lip synchronization framework that combines mask-free and mask-based techniques, achieving high-fidelity, adaptable, and realistic talking videos across diverse real-world scenarios and challenging conditions.
Contribution
The paper introduces UniSync, a unified approach that integrates pose-anchored training and blending inference, along with domain adaptation and a new benchmark for real-world evaluation.
Findings
Outperforms state-of-the-art methods in diverse scenarios
Handles complex conditions like stylized avatars and occlusions
Demonstrates high domain adaptability and realism
Abstract
Lip synchronization aims to generate realistic talking videos that match given audio, which is essential for high-quality video dubbing. However, current methods have fundamental drawbacks: mask-based approaches suffer from local color discrepancies, while mask-free methods struggle with global background texture misalignment. Furthermore, most methods struggle with diverse real-world scenarios such as stylized avatars, face occlusion, and extreme lighting conditions. In this paper, we propose UniSync, a unified framework designed for achieving high-fidelity lip synchronization in diverse scenarios. Specifically, UniSync uses a mask-free pose-anchored training strategy to keep head motion and eliminate synthesis color artifacts, while employing mask-based blending consistent inference to ensure structural precision and smooth blending. Notably, fine-tuning on compact but diverse videos…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Generative Adversarial Networks and Image Synthesis
