Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis
Dogucan Yaman, Seymanur Akti, Fevziye Irem Eyiokur, Alexander Waibel

TL;DR
This paper introduces a novel text-to-talking-face synthesis method that uses shared latent speech representations to generate synchronized speech and facial animations, improving realism and lip-sync accuracy.
Contribution
It presents a joint text-to-audio-visual synthesis framework leveraging latent speech features and a two-stage training process to enhance alignment and realism.
Findings
Outperforms cascaded pipelines in lip-sync accuracy
Produces more natural and expressive speech and facial animations
Maintains speaker identity without ground-truth audio at inference
Abstract
We propose a text-to-talking-face synthesis framework leveraging latent speech representations from HierSpeech++. A Text-to-Vec module generates Wav2Vec2 embeddings from text, which jointly condition speech and face generation. To handle distribution shifts between clean and TTS-predicted features, we adopt a two-stage training: pretraining on Wav2Vec2 embeddings and finetuning on TTS outputs. This enables tight audio-visual alignment, preserves speaker identity, and produces natural, expressive speech and synchronized facial motion without ground-truth audio at inference. Experiments show that conditioning on TTS-predicted latent features outperforms cascaded pipelines, improving both lip-sync and visual realism.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
