UniSync: Towards Generalizable and High-Fidelity Lip Synchronization for Challenging Scenarios

Ruidi Fan; Yang Zhou; Siyuan Wang; Tian Yu; Yutong Jiang; Xusheng Liu

arXiv:2603.03882·cs.CV·March 5, 2026

UniSync: Towards Generalizable and High-Fidelity Lip Synchronization for Challenging Scenarios

Ruidi Fan, Yang Zhou, Siyuan Wang, Tian Yu, Yutong Jiang, Xusheng Liu

PDF

Open Access

TL;DR

UniSync is a novel lip synchronization framework that combines mask-free and mask-based techniques, achieving high-fidelity, adaptable, and realistic talking videos across diverse real-world scenarios and challenging conditions.

Contribution

The paper introduces UniSync, a unified approach that integrates pose-anchored training and blending inference, along with domain adaptation and a new benchmark for real-world evaluation.

Findings

01

Outperforms state-of-the-art methods in diverse scenarios

02

Handles complex conditions like stylized avatars and occlusions

03

Demonstrates high domain adaptability and realism

Abstract

Lip synchronization aims to generate realistic talking videos that match given audio, which is essential for high-quality video dubbing. However, current methods have fundamental drawbacks: mask-based approaches suffer from local color discrepancies, while mask-free methods struggle with global background texture misalignment. Furthermore, most methods struggle with diverse real-world scenarios such as stylized avatars, face occlusion, and extreme lighting conditions. In this paper, we propose UniSync, a unified framework designed for achieving high-fidelity lip synchronization in diverse scenarios. Specifically, UniSync uses a mask-free pose-anchored training strategy to keep head motion and eliminate synthesis color artifacts, while employing mask-based blending consistent inference to ensure structural precision and smooth blending. Notably, fine-tuning on compact but diverse videos…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Generative Adversarial Networks and Image Synthesis