LatentSync: Taming Audio-Conditioned Latent Diffusion Models for Lip   Sync with SyncNet Supervision

Chunyu Li; Chao Zhang; Weikai Xu; Jingyu Lin; Jinghui Xie; Weiguo; Feng; Bingyue Peng; Cunjian Chen; Weiwei Xing

arXiv:2412.09262·cs.CV·March 14, 2025·3 cites

LatentSync: Taming Audio-Conditioned Latent Diffusion Models for Lip Sync with SyncNet Supervision

Chunyu Li, Chao Zhang, Weikai Xu, Jingyu Lin, Jinghui Xie, Weiguo, Feng, Bingyue Peng, Cunjian Chen, Weiwei Xing

PDF

Open Access 1 Repo 8 Models

TL;DR

This paper improves audio-driven lip-sync in portrait animation by integrating SyncNet supervision into latent diffusion models, addressing shortcut learning, and introducing StableSyncNet and TREPA for better convergence and temporal consistency, achieving state-of-the-art results.

Contribution

It presents a novel approach to enhance lip-sync accuracy by explicitly enforcing audio-visual correlation through SyncNet supervision and introduces StableSyncNet and TREPA mechanisms.

Findings

01

StableSyncNet improves SyncNet convergence from 91% to 94%.

02

The proposed method surpasses state-of-the-art lip-sync methods on HDTF and VoxCeleb2 datasets.

03

Enhanced temporal consistency in generated videos with TREPA.

Abstract

End-to-end audio-conditioned latent diffusion models (LDMs) have been widely adopted for audio-driven portrait animation, demonstrating their effectiveness in generating lifelike and high-resolution talking videos. However, direct application of audio-conditioned LDMs to lip-synchronization (lip-sync) tasks results in suboptimal lip-sync accuracy. Through an in-depth analysis, we identified the underlying cause as the "shortcut learning problem", wherein the model predominantly learns visual-visual shortcuts while neglecting the critical audio-visual correlations. To address this issue, we explored different approaches for integrating SyncNet supervision into audio-conditioned LDMs to explicitly enforce the learning of audio-visual correlations. Since the performance of SyncNet directly influences the lip-sync accuracy of the supervised model, the training of a well-converged SyncNet…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bytedance/LatentSync
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing

MethodsALIGN · Diffusion