ES4R: Speech Encoding Based on Prepositive Affective Modeling for Empathetic Response Generation
Zhuoyue Gao, Xiaohui Wang, Xiaocui Yang, Wen Zhang, Daling Wang, Shi Feng, Yifei Zhang

TL;DR
ES4R is a novel speech encoding framework that explicitly models affective context for more empathetic and coherent response generation in dialogue systems, improving over existing methods.
Contribution
It introduces a dual-level attention mechanism for structured affective modeling and integrates it with speech and text for empathetic response generation.
Findings
Outperforms baseline models in automatic evaluations.
Achieves higher human-rated empathy and coherence.
Robust across different language model backbones.
Abstract
Empathetic speech dialogue requires not only understanding linguistic content but also perceiving rich paralinguistic information such as prosody, tone, and emotional intensity for affective understandings. Existing speech-to-speech large language models either rely on ASR transcription or use encoders to extract latent representations, often weakening affective information and contextual coherence in multi-turn dialogues. To address this, we propose \textbf{ES4R}, a framework for speech-based empathetic response generation. Our core innovation lies in explicitly modeling structured affective context before speech encoding, rather than relying on implicit learning by the encoder or explicit emotion supervision. Specifically, we introduce a dual-level attention mechanism to capture turn-level affective states and dialogue-level affective dynamics. The resulting affective representations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Emotion and Mood Recognition · Sentiment Analysis and Opinion Mining
