SayAnything: Audio-Driven Lip Synchronization with Conditional Video   Diffusion

Junxian Ma; Shiwen Wang; Jian Yang; Junyi Hu; Jian Liang; Guosheng; Lin; Jingbo chen; Kai Li; Yu Meng

arXiv:2502.11515·cs.CV·February 18, 2025

SayAnything: Audio-Driven Lip Synchronization with Conditional Video Diffusion

Junxian Ma, Shiwen Wang, Jian Yang, Junyi Hu, Jian Liang, Guosheng, Lin, Jingbo chen, Kai Li, Yu Meng

PDF

Open Access

TL;DR

SayAnything is a novel diffusion-based framework that directly synthesizes realistic lip movements from audio input, maintaining speaker identity and enabling precise control without complex training pipelines.

Contribution

It introduces a conditional video diffusion model with specialized modules for identity preservation, audio guidance, and editing control, simplifying training and improving naturalness.

Findings

01

Generates highly realistic lip-synced videos with improved coherence.

02

Effectively generalizes to unseen characters and animated figures.

03

Balances multiple conditions in latent space for precise control.

Abstract

Recent advances in diffusion models have led to significant progress in audio-driven lip synchronization. However, existing methods typically rely on constrained audio-visual alignment priors or multi-stage learning of intermediate representations to force lip motion synthesis. This leads to complex training pipelines and limited motion naturalness. In this paper, we present SayAnything, a conditional video diffusion framework that directly synthesizes lip movements from audio input while preserving speaker identity. Specifically, we propose three specialized modules including identity preservation module, audio guidance module, and editing control module. Our novel design effectively balances different condition signals in the latent space, enabling precise control over appearance, motion, and region-specific generation without requiring additional supervision signals or intermediate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music Technology and Sound Studies · Music and Audio Processing

MethodsDiffusion