ESC-Judge: A Framework for Comparing Emotional Support Conversational Agents
Navid Madani, Rohini Srihari

TL;DR
ESC-Judge is a scalable, automated framework grounded in counseling theory that evaluates emotional-support chatbots by simulating realistic scenarios and comparing model responses with human-level reliability.
Contribution
It introduces the first end-to-end, theory-grounded, automated evaluation framework for emotional-support LLMs, enabling scalable and interpretable comparisons.
Findings
Matched human annotator decisions at over 80% accuracy
Automated evaluation reduces cost and time compared to human annotation
Provides transparent, theory-based assessment of emotional support quality
Abstract
Large language models (LLMs) increasingly power mental-health chatbots, yet the field still lacks a scalable, theory-grounded way to decide which model is most effective to deploy. We present ESC-Judge, the first end-to-end evaluation framework that (i) grounds head-to-head comparisons of emotional-support LLMs in Clara Hill's established Exploration-Insight-Action counseling model, providing a structured and interpretable view of performance, and (ii) fully automates the evaluation pipeline at scale. ESC-Judge operates in three stages: first, it synthesizes realistic help-seeker roles by sampling empirically salient attributes such as stressors, personality, and life history; second, it has two candidate support agents conduct separate sessions with the same role, isolating model-specific strategies; and third, it asks a specialized judge LLM to express pairwise preferences across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDigital Mental Health Interventions · Mental Health via Writing · Artificial Intelligence in Healthcare and Education
