Assessing the Quality of Mental Health Support in LLM Responses through Multi-Attribute Human Evaluation
Abeer Badawi, Md Tahmid Rahman Laskar, Elahe Rahimi, Sheri Grach, Lindsay Bertrand, Lames Danok, Frank Rudzicz, Jimmy Huang, Elham Dolatabadi

TL;DR
This study develops a human-centered evaluation method to assess the therapeutic quality of responses generated by large language models in mental health conversations, highlighting strengths and weaknesses in cognitive and affective support.
Contribution
It introduces a multidimensional human evaluation framework for mental health LLM responses, emphasizing the importance of therapeutic sensitivity and clinical relevance.
Findings
LLMs provide safe, coherent, and clinically appropriate information.
Open source models show greater variability and emotional flatness.
There is a persistent cognitive-affective gap in LLM responses.
Abstract
The escalating global mental health crisis, marked by persistent treatment gaps, availability, and a shortage of qualified therapists, positions Large Language Models (LLMs) as a promising avenue for scalable support. While LLMs offer potential for accessible emotional assistance, their reliability, therapeutic relevance, and alignment with human standards remain challenging to address. This paper introduces a human-grounded evaluation methodology designed to assess LLM generated responses in therapeutic dialogue. Our approach involved curating a dataset of 500 mental health conversations from datasets with real-world scenario questions and evaluating the responses generated by nine diverse LLMs, including closed source and open source models. More specifically, these responses were evaluated by two psychiatric trained experts, who independently rated each on a 5 point Likert scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMental Health via Writing · Digital Mental Health Interventions · Machine Learning in Healthcare
