EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models

Li Zhou; Lutong Yu; You Lyu; Yihang Lin; Zefeng Zhao; Junyi Ao; Yuhao Zhang; Benyou Wang; Haizhou Li

arXiv:2510.22758·cs.CL·March 6, 2026

EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models

Li Zhou, Lutong Yu, You Lyu, Yihang Lin, Zefeng Zhao, Junyi Ao, Yuhao Zhang, Benyou Wang, Haizhou Li

PDF

1 Datasets 3 Reviews

TL;DR

EchoMind introduces a comprehensive, multi-level benchmark for evaluating empathetic speech language models, focusing on their ability to perceive vocal cues and generate emotionally aligned responses in dialogue.

Contribution

It is the first benchmark to simulate empathetic dialogue through interconnected tasks involving speech understanding, vocal cue perception, reasoning, and response generation, with a detailed empathy framework.

Findings

01

State-of-the-art models struggle with vocal cues and empathy.

02

Models show weaknesses in instruction-following and speech variability resilience.

03

Highlighting the need for models integrating linguistic and vocal cues for empathy.

Abstract

Speech Language Models (SLMs) have made significant progress in spoken language understanding. Yet it remains unclear whether they can fully perceive non lexical vocal cues alongside spoken words, and respond with empathy that aligns with both emotional and contextual factors. Existing benchmarks typically evaluate linguistic, acoustic, reasoning, or dialogue abilities in isolation, overlooking the integration of these skills that is crucial for human-like, emotionally intelligent conversation. We present EchoMind, the first interrelated, multi-level benchmark that simulates the cognitive process of empathetic dialogue through sequential, context-linked tasks: spoken-content understanding, vocal-cue perception, integrated reasoning, and response generation. All tasks share identical and semantically neutral scripts that are free of explicit emotional or contextual cues, and controlled…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

- Novel benchmark for evaluating empathetic capabilities of Speech LLMs. High quality taxonomy covering 39 attributes across speaker, paralinguistic, and environmental dimensions, providing comprehensive coverage of non-lexical vocal cues essential for human-like conversation. - Rigorous evaluation in multiple setting, using both automatic and human evaluation at both text and audio levels. Moreover, it introduces specialized empathy metrics (EmoAlign, Vocal Empathy Score) and validates it on ma

Weaknesses

Minor - Majority TTS-generated data (646/1,137 scripts) with only 2 professional voice actors for human recordings, potentially missing natural variation and introducing artifacts that don't reflect real-world.

Reviewer 02Rating 4Confidence 4

Strengths

This is the first benchmark explicitly targeting speech-based social intelligence and empathetic understanding. The effort to formalize empathy-related evaluation dimensions in SLMs is both timely and valuable for the community.

Weaknesses

- While the paper motivates the work by criticising existing benchmarks for evaluating capabilities in isolation, the proposed benchmark does not actually integrate these skills in a meaningful way. The three “levels” (understanding, reasoning, and conversation) are still evaluated independently, using separate metrics. What is presented feels like more dimensions, not genuine integration. A stronger demonstration would involve an Arena-style extrinsic evaluation assessing the real downstream im

Reviewer 03Rating 4Confidence 4

Strengths

- The focus on empathetic capabilities in SLMs addresses a critical gap in current benchmarks, moving beyond pure linguistic understanding to emotional intelligence. - The benchmark uses controlled vocal-style variations of semantically neutral scripts across 39 vocal attributes spanning speaker information, paralinguistic cues, and environmental sounds.

Weaknesses

- The study introduces multi fine-grained tasks, but results are only reported at the coarse levels (understanding, reasoning, conversation). Showing results for each sub-task would reveal which empathy aspects, like tone recognition or emotional adaptation, remain most challenging. - While the paper lists 39 vocal attributes, their individual contributions to empathy are not analyzed. Exploring which features (e.g., pitch, tempo, timbre) drive performance would make the work more insightful an

Code & Models

Datasets

hlt-cuhksz/EchoMind
dataset· 989 dl
989 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.