Towards Multimodal Empathetic Response Generation: A Rich   Text-Speech-Vision Avatar-based Benchmark

Han Zhang; Zixiang Meng; Meng Luo; Hong Han; Lizi Liao; Erik Cambria,; Hao Fei

arXiv:2502.04976·cs.MM·February 10, 2025

Towards Multimodal Empathetic Response Generation: A Rich Text-Speech-Vision Avatar-based Benchmark

Han Zhang, Zixiang Meng, Meng Luo, Hong Han, Lizi Liao, Erik Cambria,, Hao Fei

PDF

Open Access 1 Datasets

TL;DR

This paper introduces AvaMERG, a large-scale multimodal benchmark dataset for empathetic response generation using text, speech, and avatar videos, and presents Empatheia, an end-to-end model that outperforms baselines in this task.

Contribution

It pioneers the multimodal ERG task by creating a new benchmark dataset and developing a novel end-to-end model with multimodal reasoning and empathetic tuning strategies.

Findings

01

Empatheia outperforms baseline methods on AvaMERG.

02

Multimodal inputs improve empathetic response quality.

03

The benchmark enables future research in multimodal ERG.

Abstract

Empathetic Response Generation (ERG) is one of the key tasks of the affective computing area, which aims to produce emotionally nuanced and compassionate responses to user's queries. However, existing ERG research is predominantly confined to the singleton text modality, limiting its effectiveness since human emotions are inherently conveyed through multiple modalities. To combat this, we introduce an avatar-based Multimodal ERG (MERG) task, entailing rich text, speech, and facial vision information. We first present a large-scale high-quality benchmark dataset, \textbf{AvaMERG}, which extends traditional text ERG by incorporating authentic human speech audio and dynamic talking-face avatar videos, encompassing a diverse range of avatar profiles and broadly covering various topics of real-world scenarios. Further, we deliberately tailor a system, named \textbf{Empatheia}, for MERG.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

SIChoi/AvaMERG_frame
dataset· 4 dl
4 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI in Service Interactions · Speech and dialogue systems · Social Robot Interaction and HRI