UniTalker: Conversational Speech-Visual Synthesis

Yifan Hu; Rui Liu; Yi Ren; Xiang Yin; Haizhou Li

arXiv:2508.04585·eess.AS·August 8, 2025·ACM Multimedia

UniTalker: Conversational Speech-Visual Synthesis

Yifan Hu, Rui Liu, Yi Ren, Xiang Yin, Haizhou Li

PDF

TL;DR

UniTalker is a multimodal system that enhances conversational speech synthesis by generating emotionally expressive speech and synchronized talking-face animations, improving user interaction through audiovisual responses.

Contribution

It introduces a novel multimodal dialogue understanding and synthesis framework, integrating speech, text, and visual cues for more empathetic and natural conversational agents.

Findings

01

Produces more empathetic speech responses.

02

Generates natural talking-face animations.

03

Outperforms existing models in emotional consistency.

Abstract

Conversational Speech Synthesis (CSS) is a key task in the user-agent interaction area, aiming to generate more expressive and empathetic speech for users. However, it is well-known that "listening" and "eye contact" play crucial roles in conveying emotions during real-world interpersonal communication. Existing CSS research is limited to perceiving only text and speech within the dialogue context, which restricts its effectiveness. Moreover, speech-only responses further constrain the interactive experience. To address these limitations, we introduce a Conversational Speech-Visual Synthesis (CSVS) task as an extension of traditional CSS. By leveraging multimodal dialogue context, it provides users with coherent audiovisual responses. To this end, we develop a CSVS system named UniTalker, which is a unified model that seamlessly integrates multimodal perception and multimodal rendering…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.