Scoring with Large Language Models: A Study on Measuring Empathy of Responses in Dialogues
Henry J. Xie, Jinghan Zhang, Xinhao Zhang, Kunpeng Liu

TL;DR
This study investigates how large language models score empathy in dialogues, developing a framework that uses explicit features and classifiers to understand and approximate LLM empathy scoring performance.
Contribution
It introduces a novel framework for analyzing LLM empathy scoring using explicit features and classifiers, enhancing interpretability and understanding of LLM evaluation methods.
Findings
Embedding-based features achieve performance close to generic LLMs.
Combining MITI Code and explicit subfactors improves scoring accuracy.
Feature selection identifies key features for empathy scoring.
Abstract
In recent years, Large Language Models (LLMs) have become increasingly more powerful in their ability to complete complex tasks. One such task in which LLMs are often employed is scoring, i.e., assigning a numerical value from a certain scale to a subject. In this paper, we strive to understand how LLMs score, specifically in the context of empathy scoring. We develop a novel and comprehensive framework for investigating how effective LLMs are at measuring and scoring empathy of responses in dialogues, and what methods can be employed to deepen our understanding of LLM scoring. Our strategy is to approximate the performance of state-of-the-art and fine-tuned LLMs with explicit and explainable features. We train classifiers using various features of dialogues including embeddings, the Motivational Interviewing Treatment Integrity (MITI) Code, a set of explicit subfactors of empathy as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Topic Modeling
MethodsSparse Evolutionary Training · Feature Selection
