When Large Language Models are Reliable for Judging Empathic Communication

Aakriti Kumar; Nalin Poungpeth; Diyi Yang; Erina Farrell; Bruce Lambert; Matthew Groh

arXiv:2506.10150·cs.CL·October 6, 2025

When Large Language Models are Reliable for Judging Empathic Communication

Aakriti Kumar, Nalin Poungpeth, Diyi Yang, Erina Farrell, Bruce Lambert, Matthew Groh

PDF

Open Access 1 Repo

TL;DR

This study evaluates the reliability of large language models in judging empathic communication, comparing their performance to experts and crowdworkers across multiple frameworks in real conversations.

Contribution

It demonstrates that LLMs can approach expert-level reliability in assessing empathic communication, surpassing crowdworkers, and highlights the importance of benchmark selection.

Findings

01

LLMs approach expert agreement levels in empathy judgment

02

LLMs outperform crowdworkers in reliability

03

Expert agreement varies with framework complexity

Abstract

Large language models (LLMs) excel at generating empathic responses in text-based conversations. But, how reliably do they judge the nuances of empathic communication? We investigate this question by comparing how experts, crowdworkers, and LLMs annotate empathic communication across four evaluative frameworks drawn from psychology, natural language processing, and communications applied to 200 real-world conversations where one speaker shares a personal problem and the other offers support. Drawing on 3,150 expert annotations, 2,844 crowd annotations, and 3,150 LLM annotations, we assess inter-rater reliability between these three annotator groups. We find that expert agreement is high but varies across the frameworks' sub-components depending on their clarity, complexity, and subjectivity. We show that expert agreement offers a more informative benchmark for contextualizing LLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aakriti1kumar/replication-data-and-code-when-LLMs-reliable-empathic-communication
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Topic Modeling · Artificial Intelligence in Healthcare and Education