TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots

Fangrui Huang; Souhad Chbeir; Arpandeep Khatua; Sheng Wang; Sijun Tan; Kenan Ye; Lily Bailey; Merryn Daniel; Ryan Louie; Sanmi Koyejo; Ehsan Adeli

arXiv:2603.18008·cs.CL·March 20, 2026

TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots

Fangrui Huang, Souhad Chbeir, Arpandeep Khatua, Sheng Wang, Sijun Tan, Kenan Ye, Lily Bailey, Merryn Daniel, Ryan Louie, Sanmi Koyejo, Ehsan Adeli

PDF

Open Access 3 Reviews

TL;DR

This paper presents THERAPYGYM, a comprehensive framework for evaluating and improving therapy chatbots by measuring clinical fidelity and safety, using automated scoring and expert-validated benchmarks.

Contribution

It introduces a novel evaluation framework and training methods for therapy chatbots that focus on clinical fidelity and safety, addressing gaps in existing benchmarks.

Findings

01

Models trained with THERAPYGYM show significant improvements in clinical fidelity.

02

The framework enables scalable development of safer, more faithful therapy chatbots.

03

Expert ratings validate the effectiveness of the evaluation and training methods.

Abstract

Large language models (LLMs) are increasingly used for mental-health support; yet prevailing evaluation methods--fluency metrics, preference tests, and generic dialogue benchmarks--fail to capture the clinically critical dimensions of psychotherapy. We introduce THERAPYGYM, a framework that evaluates and improves therapy chatbots along two clinical pillars: fidelity and safety. Fidelity is measured using the Cognitive Therapy Rating Scale (CTRS), implemented as an automated pipeline that scores adherence to CBT techniques over multi-turn sessions. Safety is assessed using a multi-label annotation scheme, covering therapy-specific risks (e.g., failing to address harm or abuse). To mitigate bias and unreliability in LLM-based judges, we further release THERAPYJUDGEBENCH, a validation set of 116 dialogues with 1,270 expert ratings for auditing and calibration against licensed clinicians.…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

The work contributes a new evaluation method for LLM therapy chatbots by focusing on clinically relevant measures of performance; namely, adherence and competence in delivering treatment, and the avoidance of harmful behaviors. Their framework uses a clinically validated scale (CTRS) to evaluate and fine-tune LLMs as therapists.

Weaknesses

TherapyGym was not evaluated on a realistic, comprehensive dataset of therapist-patient dialogues (it uses just 116 dialogues, each consisting of only 5 turns per agent). General-purpose and mental health-specific LLMs are often biased and generate false information. Previous work cited in the introduction specifically cautions against using simulated clients and therapists to avoid this and instead suggests evaluating LLMs on real-world therapist-client interactions, making the reliance on sim

Reviewer 02Rating 4Confidence 3

Strengths

- Framing evaluation around clinical constructs (fidelity and safety) is exactly the direction the community needs; CTRS-linked behavioral coding is far more meaningful than generic fluency or preference metrics. - A dual-purpose environment (evaluation plus RL training harness) is attractive and pragmatic—closing the loop from measurement to improvement. - LLM-judge validation against licensed clinicians via THERAPYJUDGEBENCH is a good step toward quantifying judge reliability. - Emphasis on mu

Weaknesses

- CTRS automation is underspecified. The paper lacks a clear rubric, item mappings (CTS-R/CTRS to the “9 CBT skills”), scoring granularity (turn vs session), aggregation rules, and reliability analyses, making it hard to audit whether CTRS is faithfully operationalized. - Human rating setup is thin and ambiguous. Only four raters are mentioned; rating units (turn/skill/session) are unclear; inter-rater reliability is missing; claims that gains “transfer to expert ratings” lack quantitative evide

Reviewer 03Rating 8Confidence 3

Strengths

The methodology is sound and rigorously designed. The use of the established CTRS scale provides a validated foundation for assessing therapeutic fidelity. The experiments are comprehensive, covering inter-rater reliability among human experts, alignment between LLM judges and humans, and the end-to-end efficacy of the RL fine-tuning pipeline. The decision to exclude CTRS dimensions with low human-human agreement (e.g., Guided Discovery) before reward modeling is a prudent choice that strengthen

Weaknesses

As noted, the simulated patients, while based on cognitive models, may behave more ideally and articulately than real-world patients who are often hesitant, ambivalent, or inarticulate. This could limit the framework's ability to evaluate crucial therapist skills like dealing with resistance, navigating ambiguity, and building rapport with a reluctant client. Further, The four safety categories are a good start but may not encompass all emergent risks in therapeutic AI. For instance, the risk of

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Mental Health Interventions · Mental Health via Writing · Artificial Intelligence in Healthcare and Education