Speech Driven Video Editing via an Audio-Conditioned Diffusion Model
Dan Bigioi, Shubhajit Basak, Micha{\l} Stypu{\l}kowski, Maciej, Zi\k{e}ba, Hugh Jordan, Rachel McDonnell, Peter Corcoran

TL;DR
This paper introduces an end-to-end diffusion model for speech-driven video editing that synchronizes facial motions with audio without intermediate representations, demonstrating feasibility on multi-speaker videos.
Contribution
It is the first to apply denoising diffusion models to audio-driven video editing, enabling direct synchronization of facial motions from speech audio.
Findings
Successful lip and jaw motion synchronization without facial landmarks
Demonstrated on single and multi-speaker videos
Provides a baseline for future diffusion-based video editing methods
Abstract
Taking inspiration from recent developments in visual generative tasks using diffusion models, we propose a method for end-to-end speech-driven video editing using a denoising diffusion model. Given a video of a talking person, and a separate auditory speech recording, the lip and jaw motions are re-synchronized without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model on audio mel spectral features to generate synchronised facial motion. Proof of concept results are demonstrated on both single-speaker and multi-speaker video editing, providing a baseline model on the CREMA-D audiovisual data set. To the best of our knowledge, this is the first work to demonstrate and validate the feasibility of applying end-to-end denoising diffusion models to the task of audio-driven…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Speech and Audio Processing
MethodsDiffusion
