E-chat: Emotion-sensitive Spoken Dialogue System with Large Language Models
Hongfei Xue, Yuhao Liang, Bingshen Mu, Shiliang Zhang, Mengzhe Chen,, Qian Chen, Lei Xie

TL;DR
E-chat is a novel spoken dialogue system that uses emotion embeddings and large language models to understand and respond to emotional speech, improving emotional comprehension in human-machine interactions.
Contribution
The paper introduces E-chat, a new emotion-sensitive dialogue system that integrates speech emotion embeddings with LLMs and presents the E-chat200 dataset for emotion-aware dialogue research.
Findings
E-chat outperforms baseline models in emotional comprehension tasks.
The system effectively responds to different emotional contexts.
E-chat200 dataset facilitates emotion-sensitive dialogue research.
Abstract
This study focuses on emotion-sensitive spoken dialogue in human-machine speech interaction. With the advancement of Large Language Models (LLMs), dialogue systems can handle multimodal data, including audio. Recent models have enhanced the understanding of complex audio signals through the integration of various audio events. However, they are unable to generate appropriate responses based on emotional speech. To address this, we introduce the Emotional chat Model (E-chat), a novel spoken dialogue system capable of comprehending and responding to emotions conveyed from speech. This model leverages an emotion embedding extracted by a speech encoder, combined with LLMs, enabling it to respond according to different emotional contexts. Additionally, we introduce the E-chat200 dataset, designed explicitly for emotion-sensitive spoken dialogue. In various evaluation metrics, E-chat…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Emotion and Mood Recognition · Speech Recognition and Synthesis
