ES4R: Speech Encoding Based on Prepositive Affective Modeling for Empathetic Response Generation

Zhuoyue Gao; Xiaohui Wang; Xiaocui Yang; Wen Zhang; Daling Wang; Shi Feng; Yifei Zhang

arXiv:2601.16225·eess.AS·January 26, 2026

ES4R: Speech Encoding Based on Prepositive Affective Modeling for Empathetic Response Generation

Zhuoyue Gao, Xiaohui Wang, Xiaocui Yang, Wen Zhang, Daling Wang, Shi Feng, Yifei Zhang

PDF

Open Access

TL;DR

ES4R is a novel speech encoding framework that explicitly models affective context for more empathetic and coherent response generation in dialogue systems, improving over existing methods.

Contribution

It introduces a dual-level attention mechanism for structured affective modeling and integrates it with speech and text for empathetic response generation.

Findings

01

Outperforms baseline models in automatic evaluations.

02

Achieves higher human-rated empathy and coherence.

03

Robust across different language model backbones.

Abstract

Empathetic speech dialogue requires not only understanding linguistic content but also perceiving rich paralinguistic information such as prosody, tone, and emotional intensity for affective understandings. Existing speech-to-speech large language models either rely on ASR transcription or use encoders to extract latent representations, often weakening affective information and contextual coherence in multi-turn dialogues. To address this, we propose \textbf{ES4R}, a framework for speech-based empathetic response generation. Our core innovation lies in explicitly modeling structured affective context before speech encoding, rather than relying on implicit learning by the encoder or explicit emotion supervision. Specifically, we introduce a dual-level attention mechanism to capture turn-level affective states and dialogue-level affective dynamics. The resulting affective representations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Emotion and Mood Recognition · Sentiment Analysis and Opinion Mining