Responsible Evaluation of AI for Mental Health

Hiba Arnaout; Anmol Goel; H. Andrew Schwartz; Steffen T. Eberhardt; Dana Atzil-Slonim; Gavin Doherty; Brian Schwartz; Wolfgang Lutz; Tim Althoff; Munmun De Choudhury; Hamidreza Jamalabadi; Raj Sanjay Shah; Flor Miriam Plaza-del-Arco; Dirk Hovy; Maria Liakata; Iryna Gurevych

arXiv:2602.00065·cs.CY·April 29, 2026

Responsible Evaluation of AI for Mental Health

Hiba Arnaout, Anmol Goel, H. Andrew Schwartz, Steffen T. Eberhardt, Dana Atzil-Slonim, Gavin Doherty, Brian Schwartz, Wolfgang Lutz, Tim Althoff, Munmun De Choudhury, Hamidreza Jamalabadi, Raj Sanjay Shah, Flor Miriam Plaza-del-Arco, Dirk Hovy, Maria Liakata, Iryna Gurevych

PDF

TL;DR

This paper advocates for a comprehensive, interdisciplinary framework for evaluating AI tools in mental health, emphasizing clinical validity, social context, and equity, based on analysis of recent research and case studies.

Contribution

It introduces a structured evaluation framework and taxonomy for AI mental health tools, addressing current gaps in clinical relevance, safety, and user experience.

Findings

01

Current evaluations rely on generic metrics that miss clinical and social nuances.

02

Many studies lack participation from mental health professionals.

03

There are notable gaps in safety and equity considerations.

Abstract

Although artificial intelligence (AI) shows growing promise for mental health care, current approaches to evaluating AI tools in this domain remain fragmented and poorly aligned with clinical practice, social context, and first-hand user experience. This paper argues for a rethinking of responsible evaluation -- what is measured, by whom, and for what purpose -- by introducing an interdisciplinary framework that integrates clinical soundness, social context, and equity, providing a structured basis for evaluation. Through an analysis of 135 recent *CL publications, we identify recurring limitations, including over-reliance on generic metrics that do not capture clinical validity, therapeutic appropriateness, or user experience, limited participation from mental health professionals, and insufficient attention to safety and equity. To address these gaps, we propose a taxonomy of AI…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.