TAVID: Text-Driven Audio-Visual Interactive Dialogue Generation

Ji-Hoon Kim; Junseok Ahn; Doyeop Kwak; Joon Son Chung; Shinji Watanabe

arXiv:2512.20296·cs.CV·December 24, 2025

TAVID: Text-Driven Audio-Visual Interactive Dialogue Generation

Ji-Hoon Kim, Junseok Ahn, Doyeop Kwak, Joon Son Chung, Shinji Watanabe

PDF

Open Access

TL;DR

This paper introduces TAVID, a unified framework that jointly generates synchronized interactive faces and speech from text and images, addressing the multimodal nature of human conversation.

Contribution

TAVID is the first system to integrate face and speech generation with bidirectional cross-modal mappers for synchronized audio-visual dialogue.

Findings

01

Effective in generating realistic talking faces.

02

Responsive listening head behaviors achieved.

03

High-quality conversational speech produced.

Abstract

The objective of this paper is to jointly synthesize interactive videos and conversational speech from text and reference images. With the ultimate goal of building human-like conversational systems, recent studies have explored talking or listening head generation as well as conversational speech generation. However, these works are typically studied in isolation, overlooking the multimodal nature of human conversation, which involves tightly coupled audio-visual interactions. In this paper, we introduce TAVID, a unified framework that generates both interactive faces and conversational speech in a synchronized manner. TAVID integrates face and speech generation pipelines through two cross-modal mappers (i.e., a motion mapper and a speaker mapper), which enable bidirectional exchange of complementary information between the audio and visual modalities. We evaluate our system across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis