Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis

Dogucan Yaman; Seymanur Akti; Fevziye Irem Eyiokur; Alexander Waibel

arXiv:2511.05432·cs.CV·November 10, 2025

Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis

Dogucan Yaman, Seymanur Akti, Fevziye Irem Eyiokur, Alexander Waibel

PDF

Open Access

TL;DR

This paper introduces a novel text-to-talking-face synthesis method that uses shared latent speech representations to generate synchronized speech and facial animations, improving realism and lip-sync accuracy.

Contribution

It presents a joint text-to-audio-visual synthesis framework leveraging latent speech features and a two-stage training process to enhance alignment and realism.

Findings

01

Outperforms cascaded pipelines in lip-sync accuracy

02

Produces more natural and expressive speech and facial animations

03

Maintains speaker identity without ground-truth audio at inference

Abstract

We propose a text-to-talking-face synthesis framework leveraging latent speech representations from HierSpeech++. A Text-to-Vec module generates Wav2Vec2 embeddings from text, which jointly condition speech and face generation. To handle distribution shifts between clean and TTS-predicted features, we adopt a two-stage training: pretraining on Wav2Vec2 embeddings and finetuning on TTS outputs. This enables tight audio-visual alignment, preserves speaker identity, and produces natural, expressive speech and synchronized facial motion without ground-truth audio at inference. Experiments show that conditioning on TTS-predicted latent features outperforms cascaded pipelines, improving both lip-sync and visual realism.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing