BLSP-Emo: Towards Empathetic Large Speech-Language Models
Chen Wang, Minpeng Liao, Zhongqiang Huang, Junhong Wu, Chengqing Zong,, Jiajun Zhang

TL;DR
BLSP-Emo introduces an end-to-end speech-language model that understands semantics and emotions, generating empathetic responses by leveraging existing ASR and SER datasets through a two-stage pretraining process.
Contribution
It presents a novel two-stage pretraining approach for an empathetic speech-language model using existing datasets, advancing emotional understanding in speech models.
Findings
BLSP-Emo effectively comprehends speech and emotions.
The model generates empathetic responses in conversations.
It outperforms baseline models in instruction-following tasks.
Abstract
The recent release of GPT-4o showcased the potential of end-to-end multimodal models, not just in terms of low latency but also in their ability to understand and generate expressive speech with rich emotions. While the details are unknown to the open research community, it likely involves significant amounts of curated data and compute, neither of which is readily accessible. In this paper, we present BLSP-Emo (Bootstrapped Language-Speech Pretraining with Emotion support), a novel approach to developing an end-to-end speech-language model capable of understanding both semantics and emotions in speech and generate empathetic responses. BLSP-Emo utilizes existing speech recognition (ASR) and speech emotion recognition (SER) datasets through a two-stage process. The first stage focuses on semantic alignment, following recent work on pretraining speech-language models using ASR data. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling
