UniTalker: Conversational Speech-Visual Synthesis
Yifan Hu, Rui Liu, Yi Ren, Xiang Yin, Haizhou Li

TL;DR
UniTalker is a multimodal system that enhances conversational speech synthesis by generating emotionally expressive speech and synchronized talking-face animations, improving user interaction through audiovisual responses.
Contribution
It introduces a novel multimodal dialogue understanding and synthesis framework, integrating speech, text, and visual cues for more empathetic and natural conversational agents.
Findings
Produces more empathetic speech responses.
Generates natural talking-face animations.
Outperforms existing models in emotional consistency.
Abstract
Conversational Speech Synthesis (CSS) is a key task in the user-agent interaction area, aiming to generate more expressive and empathetic speech for users. However, it is well-known that "listening" and "eye contact" play crucial roles in conveying emotions during real-world interpersonal communication. Existing CSS research is limited to perceiving only text and speech within the dialogue context, which restricts its effectiveness. Moreover, speech-only responses further constrain the interactive experience. To address these limitations, we introduce a Conversational Speech-Visual Synthesis (CSVS) task as an extension of traditional CSS. By leveraging multimodal dialogue context, it provides users with coherent audiovisual responses. To this end, we develop a CSVS system named UniTalker, which is a unified model that seamlessly integrates multimodal perception and multimodal rendering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
