Multi Modal Adaptive Normalization for Audio to Video Generation
Neeraj Kumar, Srishti Goel, Ankur Narang, Brejesh Lall

TL;DR
This paper introduces a multi-modal adaptive normalization architecture for generating highly expressive talking-head videos from audio and a static image, improving realism and synchronization in speech-driven facial video synthesis.
Contribution
The paper proposes a novel multi-modal adaptive normalization method that effectively combines audio and visual features to synthesize realistic talking-head videos from a single image.
Findings
Outperforms existing methods like RSDGAN and Speech2Vid on multiple metrics
Achieves high levels of lip synchronization and facial expressiveness
Demonstrates superior qualitative results and user preference in Turing tests
Abstract
Speech-driven facial video generation has been a complex problem due to its multi-modal aspects namely audio and video domain. The audio comprises lots of underlying features such as expression, pitch, loudness, prosody(speaking style) and facial video has lots of variability in terms of head movement, eye blinks, lip synchronization and movements of various facial action units along with temporal smoothness. Synthesizing highly expressive facial videos from the audio input and static image is still a challenging task for generative adversarial networks. In this paper, we propose a multi-modal adaptive normalization(MAN) based architecture to synthesize a talking person video of arbitrary length using as input: an audio signal and a single image of a person. The architecture uses the multi-modal adaptive normalization, keypoint heatmap predictor, optical flow predictor and class…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Speech and Audio Processing
MethodsHeatmap
