LatentSync: Taming Audio-Conditioned Latent Diffusion Models for Lip Sync with SyncNet Supervision
Chunyu Li, Chao Zhang, Weikai Xu, Jingyu Lin, Jinghui Xie, Weiguo, Feng, Bingyue Peng, Cunjian Chen, Weiwei Xing

TL;DR
This paper improves audio-driven lip-sync in portrait animation by integrating SyncNet supervision into latent diffusion models, addressing shortcut learning, and introducing StableSyncNet and TREPA for better convergence and temporal consistency, achieving state-of-the-art results.
Contribution
It presents a novel approach to enhance lip-sync accuracy by explicitly enforcing audio-visual correlation through SyncNet supervision and introduces StableSyncNet and TREPA mechanisms.
Findings
StableSyncNet improves SyncNet convergence from 91% to 94%.
The proposed method surpasses state-of-the-art lip-sync methods on HDTF and VoxCeleb2 datasets.
Enhanced temporal consistency in generated videos with TREPA.
Abstract
End-to-end audio-conditioned latent diffusion models (LDMs) have been widely adopted for audio-driven portrait animation, demonstrating their effectiveness in generating lifelike and high-resolution talking videos. However, direct application of audio-conditioned LDMs to lip-synchronization (lip-sync) tasks results in suboptimal lip-sync accuracy. Through an in-depth analysis, we identified the underlying cause as the "shortcut learning problem", wherein the model predominantly learns visual-visual shortcuts while neglecting the critical audio-visual correlations. To address this issue, we explored different approaches for integrating SyncNet supervision into audio-conditioned LDMs to explicitly enforce the learning of audio-visual correlations. Since the performance of SyncNet directly influences the lip-sync accuracy of the supervised model, the training of a well-converged SyncNet…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ndkhanh95/LatentSyncmodel· ♡ 1♡ 1
- 🤗chunyu-li/LatentSyncmodel· ♡ 50♡ 50
- 🤗ByteDance/LatentSyncmodel· ♡ 50♡ 50
- 🤗Isi99999/LatentSyncmodel
- 🤗ByteDance/LatentSync-1.5model· 5.0k dl· ♡ 875.0k dl♡ 87
- 🤗ByteDance/LatentSync-1.6model· 164k dl· ♡ 65164k dl♡ 65
- 🤗Kirfol/LatentSyncmodel· ♡ 1♡ 1
- 🤗A8KC/LatentSync-1.5model· 3 dl3 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing
MethodsALIGN · Diffusion
