Assessing the Quality of Mental Health Support in LLM Responses through Multi-Attribute Human Evaluation

Abeer Badawi; Md Tahmid Rahman Laskar; Elahe Rahimi; Sheri Grach; Lindsay Bertrand; Lames Danok; Frank Rudzicz; Jimmy Huang; Elham Dolatabadi

arXiv:2601.18630·cs.AI·January 27, 2026

Assessing the Quality of Mental Health Support in LLM Responses through Multi-Attribute Human Evaluation

Abeer Badawi, Md Tahmid Rahman Laskar, Elahe Rahimi, Sheri Grach, Lindsay Bertrand, Lames Danok, Frank Rudzicz, Jimmy Huang, Elham Dolatabadi

PDF

Open Access

TL;DR

This study develops a human-centered evaluation method to assess the therapeutic quality of responses generated by large language models in mental health conversations, highlighting strengths and weaknesses in cognitive and affective support.

Contribution

It introduces a multidimensional human evaluation framework for mental health LLM responses, emphasizing the importance of therapeutic sensitivity and clinical relevance.

Findings

01

LLMs provide safe, coherent, and clinically appropriate information.

02

Open source models show greater variability and emotional flatness.

03

There is a persistent cognitive-affective gap in LLM responses.

Abstract

The escalating global mental health crisis, marked by persistent treatment gaps, availability, and a shortage of qualified therapists, positions Large Language Models (LLMs) as a promising avenue for scalable support. While LLMs offer potential for accessible emotional assistance, their reliability, therapeutic relevance, and alignment with human standards remain challenging to address. This paper introduces a human-grounded evaluation methodology designed to assess LLM generated responses in therapeutic dialogue. Our approach involved curating a dataset of 500 mental health conversations from datasets with real-world scenario questions and evaluating the responses generated by nine diverse LLMs, including closed source and open source models. More specifically, these responses were evaluated by two psychiatric trained experts, who independently rated each on a 5 point Likert scale…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMental Health via Writing · Digital Mental Health Interventions · Machine Learning in Healthcare