StyleTalker: One-shot Style-based Audio-driven Talking Head Video   Generation

Dongchan Min; Minyoung Song; Eunji Ko; Sung Ju Hwang

arXiv:2208.10922·cs.CV·March 18, 2024·6 cites

StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation

Dongchan Min, Minyoung Song, Eunji Ko, Sung Ju Hwang

PDF

Open Access

TL;DR

StyleTalker is a new model that generates realistic, lip-synced talking head videos from a single image and audio, with controllable and independent motion features.

Contribution

It introduces novel components like a contrastive lip-sync discriminator and a disentangled motion space for improved audio-driven talking head synthesis.

Findings

01

Outperforms state-of-the-art baselines in perceptual quality

02

Achieves accurate lip synchronization with input audio

03

Enables independent control of head motion and lip movements

Abstract

We propose StyleTalker, a novel audio-driven talking head generation model that can synthesize a video of a talking person from a single reference image with accurately audio-synced lip shapes, realistic head poses, and eye blinks. Specifically, by leveraging a pretrained image generator and an image encoder, we estimate the latent codes of the talking head video that faithfully reflects the given audio. This is made possible with several newly devised components: 1) A contrastive lip-sync discriminator for accurate lip synchronization, 2) A conditional sequential variational autoencoder that learns the latent motion space disentangled from the lip movements, such that we can independently manipulate the motions and lip movements while preserving the identity. 3) An auto-regressive prior augmented with normalizing flow to learn a complex audio-to-motion multi-modal latent space.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis