Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis
Yifan Hu, Rui Liu, Yi Ren, Xiang Yin, Haizhou Li

TL;DR
Chain-Talker introduces a three-stage framework for empathetic conversational speech synthesis that improves emotional expressiveness and interpretability by mimicking human cognition and utilizing an LLM-driven emotion captioning pipeline.
Contribution
It presents a novel three-stage framework for CSS that enhances emotional perception and interpretability, supported by an LLM-based emotion captioning pipeline.
Findings
Outperforms existing methods in producing expressive, empathetic speech
Uses CSS-EmCap for reliable emotion modeling
Demonstrates effectiveness on three benchmark datasets
Abstract
Conversational Speech Synthesis (CSS) aims to align synthesized speech with the emotional and stylistic context of user-agent interactions to achieve empathy. Current generative CSS models face interpretability limitations due to insufficient emotional perception and redundant discrete speech coding. To address the above issues, we present Chain-Talker, a three-stage framework mimicking human cognition: Emotion Understanding derives context-aware emotion descriptors from dialogue history; Semantic Understanding generates compact semantic codes via serialized prediction; and Empathetic Rendering synthesizes expressive speech by integrating both components. To support emotion modeling, we develop CSS-EmCap, an LLM-driven automated pipeline for generating precise conversational speech emotion captions. Experiments on three benchmark datasets demonstrate that Chain-Talker produces more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Sentiment Analysis and Opinion Mining · Topic Modeling
