StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation
Dongchan Min, Minyoung Song, Eunji Ko, Sung Ju Hwang

TL;DR
StyleTalker is a new model that generates realistic, lip-synced talking head videos from a single image and audio, with controllable and independent motion features.
Contribution
It introduces novel components like a contrastive lip-sync discriminator and a disentangled motion space for improved audio-driven talking head synthesis.
Findings
Outperforms state-of-the-art baselines in perceptual quality
Achieves accurate lip synchronization with input audio
Enables independent control of head motion and lip movements
Abstract
We propose StyleTalker, a novel audio-driven talking head generation model that can synthesize a video of a talking person from a single reference image with accurately audio-synced lip shapes, realistic head poses, and eye blinks. Specifically, by leveraging a pretrained image generator and an image encoder, we estimate the latent codes of the talking head video that faithfully reflects the given audio. This is made possible with several newly devised components: 1) A contrastive lip-sync discriminator for accurate lip synchronization, 2) A conditional sequential variational autoencoder that learns the latent motion space disentangled from the lip movements, such that we can independently manipulate the motions and lip movements while preserving the identity. 3) An auto-regressive prior augmented with normalizing flow to learn a complex audio-to-motion multi-modal latent space.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis
