Speech Driven Video Editing via an Audio-Conditioned Diffusion Model

Dan Bigioi; Shubhajit Basak; Micha{\l} Stypu{\l}kowski; Maciej; Zi\k{e}ba; Hugh Jordan; Rachel McDonnell; Peter Corcoran

arXiv:2301.04474·cs.CV·May 12, 2023·1 cites

Speech Driven Video Editing via an Audio-Conditioned Diffusion Model

Dan Bigioi, Shubhajit Basak, Micha{\l} Stypu{\l}kowski, Maciej, Zi\k{e}ba, Hugh Jordan, Rachel McDonnell, Peter Corcoran

PDF

Open Access

TL;DR

This paper introduces an end-to-end diffusion model for speech-driven video editing that synchronizes facial motions with audio without intermediate representations, demonstrating feasibility on multi-speaker videos.

Contribution

It is the first to apply denoising diffusion models to audio-driven video editing, enabling direct synchronization of facial motions from speech audio.

Findings

01

Successful lip and jaw motion synchronization without facial landmarks

02

Demonstrated on single and multi-speaker videos

03

Provides a baseline for future diffusion-based video editing methods

Abstract

Taking inspiration from recent developments in visual generative tasks using diffusion models, we propose a method for end-to-end speech-driven video editing using a denoising diffusion model. Given a video of a talking person, and a separate auditory speech recording, the lip and jaw motions are re-synchronized without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model on audio mel spectral features to generate synchronised facial motion. Proof of concept results are demonstrated on both single-speaker and multi-speaker video editing, providing a baseline model on the CREMA-D audiovisual data set. To the best of our knowledge, this is the first work to demonstrate and validate the feasibility of applying end-to-end denoising diffusion models to the task of audio-driven…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Speech and Audio Processing

MethodsDiffusion