OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers

Ziqiao Peng; Jiwen Liu; Haoxian Zhang; Xiaoqiang Liu; Songlin Tang; Pengfei Wan; Di Zhang; Hongyan Liu; Jun He

arXiv:2505.21448·cs.CV·September 19, 2025

OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers

Ziqiao Peng, Jiwen Liu, Haoxian Zhang, Xiaoqiang Liu, Songlin Tang, Pengfei Wan, Di Zhang, Hongyan Liu, Jun He

PDF

Open Access 1 Datasets

TL;DR

OmniSync introduces a universal, mask-free diffusion transformer framework for lip synchronization that maintains identity and pose consistency across diverse visual scenarios, outperforming prior methods.

Contribution

The paper presents OmniSync, a novel diffusion transformer-based approach with a mask-free training paradigm and adaptive guidance, enabling robust, high-quality lip sync in varied visual contexts.

Findings

01

Outperforms prior methods in visual quality and lip sync accuracy

02

Works effectively on both real-world and AI-generated videos

03

Establishes the first comprehensive AIGC-LipSync Benchmark

Abstract

Lip synchronization is the task of aligning a speaker's lip movements in video with corresponding speech audio, and it is essential for creating realistic, expressive video content. However, existing methods often rely on reference frames and masked-frame inpainting, which limit their robustness to identity consistency, pose variations, facial occlusions, and stylized content. In addition, since audio signals provide weaker conditioning than visual cues, lip shape leakage from the original video will affect lip sync quality. In this paper, we present OmniSync, a universal lip synchronization framework for diverse visual scenarios. Our approach introduces a mask-free training paradigm using Diffusion Transformer models for direct frame editing without explicit masks, enabling unlimited-duration inference while maintaining natural facial dynamics and preserving character identity. During…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ZiqiaoPeng/AIGC_LipSync_Benchmark
dataset· 285 dl
285 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing