Towards Multimodal Empathetic Response Generation: A Rich Text-Speech-Vision Avatar-based Benchmark
Han Zhang, Zixiang Meng, Meng Luo, Hong Han, Lizi Liao, Erik Cambria,, Hao Fei

TL;DR
This paper introduces AvaMERG, a large-scale multimodal benchmark dataset for empathetic response generation using text, speech, and avatar videos, and presents Empatheia, an end-to-end model that outperforms baselines in this task.
Contribution
It pioneers the multimodal ERG task by creating a new benchmark dataset and developing a novel end-to-end model with multimodal reasoning and empathetic tuning strategies.
Findings
Empatheia outperforms baseline methods on AvaMERG.
Multimodal inputs improve empathetic response quality.
The benchmark enables future research in multimodal ERG.
Abstract
Empathetic Response Generation (ERG) is one of the key tasks of the affective computing area, which aims to produce emotionally nuanced and compassionate responses to user's queries. However, existing ERG research is predominantly confined to the singleton text modality, limiting its effectiveness since human emotions are inherently conveyed through multiple modalities. To combat this, we introduce an avatar-based Multimodal ERG (MERG) task, entailing rich text, speech, and facial vision information. We first present a large-scale high-quality benchmark dataset, \textbf{AvaMERG}, which extends traditional text ERG by incorporating authentic human speech audio and dynamic talking-face avatar videos, encompassing a diverse range of avatar profiles and broadly covering various topics of real-world scenarios. Further, we deliberately tailor a system, named \textbf{Empatheia}, for MERG.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in Service Interactions · Speech and dialogue systems · Social Robot Interaction and HRI
