Multi Modal Adaptive Normalization for Audio to Video Generation

Neeraj Kumar; Srishti Goel; Ankur Narang; Brejesh Lall

arXiv:2012.07304·cs.CV·December 15, 2020

Multi Modal Adaptive Normalization for Audio to Video Generation

Neeraj Kumar, Srishti Goel, Ankur Narang, Brejesh Lall

PDF

Open Access

TL;DR

This paper introduces a multi-modal adaptive normalization architecture for generating highly expressive talking-head videos from audio and a static image, improving realism and synchronization in speech-driven facial video synthesis.

Contribution

The paper proposes a novel multi-modal adaptive normalization method that effectively combines audio and visual features to synthesize realistic talking-head videos from a single image.

Findings

01

Outperforms existing methods like RSDGAN and Speech2Vid on multiple metrics

02

Achieves high levels of lip synchronization and facial expressiveness

03

Demonstrates superior qualitative results and user preference in Turing tests

Abstract

Speech-driven facial video generation has been a complex problem due to its multi-modal aspects namely audio and video domain. The audio comprises lots of underlying features such as expression, pitch, loudness, prosody(speaking style) and facial video has lots of variability in terms of head movement, eye blinks, lip synchronization and movements of various facial action units along with temporal smoothness. Synthesizing highly expressive facial videos from the audio input and static image is still a challenging task for generative adversarial networks. In this paper, we propose a multi-modal adaptive normalization(MAN) based architecture to synthesize a talking person video of arbitrary length using as input: an audio signal and a single image of a person. The architecture uses the multi-modal adaptive normalization, keypoint heatmap predictor, optical flow predictor and class…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Speech and Audio Processing

MethodsHeatmap