Narrating For You: Prompt-guided Audio-visual Narrating Face Generation Employing Multi-entangled Latent Space
Aashish Chandra, Aashutosh A V, Abhijit Das

TL;DR
This paper introduces a novel multi-entangled latent space model that synthesizes realistic talking faces by integrating voice, facial movements, and text prompts from static images, enabling synchronized audio-visual face generation.
Contribution
It proposes a new multi-entangled latent space framework that effectively combines voice, image, and text prompts for realistic face synthesis, advancing audio-visual face generation techniques.
Findings
Successfully generates synchronized audio-visual talking faces.
Effectively encodes and combines multi-modal prompts in a shared latent space.
Produces realistic and coherent face animations from static images.
Abstract
We present a novel approach for generating realistic speaking and talking faces by synthesizing a person's voice and facial movements from a static image, a voice profile, and a target text. The model encodes the prompt/driving text, the driving image, and the voice profile of an individual and then combines them to pass them to the multi-entangled latent space to foster key-value pairs and queries for the audio and video modality generation pipeline. The multi-entangled latent space is responsible for establishing the spatiotemporal person-specific features between the modalities. Further, entangled features are passed to the respective decoder of each modality for output audio and video generation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Speech and Audio Processing
