Audio Input Generates Continuous Frames to Synthesize Facial Video Using Generative Adiversarial Networks
Hanhaodi Zhang

TL;DR
This paper introduces a GAN-based method that uses short audio segments to generate continuous, realistic facial videos with temporal coherence, leveraging convolutional GRUs for improved performance.
Contribution
It proposes a novel GAN architecture with convolutional GRUs conditioned on short audio inputs for realistic speech video synthesis.
Findings
GRU improves temporal coherence of generated frames
Short audio segments suffice for realistic video synthesis
Model achieves relatively realistic facial videos from audio
Abstract
This paper presents a simple method for speech videos generation based on audio: given a piece of audio, we can generate a video of the target face speaking this audio. We propose Generative Adversarial Networks (GAN) with cut speech audio input as condition and use Convolutional Gate Recurrent Unit (GRU) in generator and discriminator. Our model is trained by exploiting the short audio and the frames in this duration. For training, we cut the audio and extract the face in the corresponding frames. We designed a simple encoder and compare the generated frames using GAN with and without GRU. We use GRU for temporally coherent frames and the results show that short audio can produce relatively realistic output results.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis
MethodsGated Recurrent Unit
