TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots
Fangrui Huang, Souhad Chbeir, Arpandeep Khatua, Sheng Wang, Sijun Tan, Kenan Ye, Lily Bailey, Merryn Daniel, Ryan Louie, Sanmi Koyejo, Ehsan Adeli

TL;DR
This paper presents THERAPYGYM, a comprehensive framework for evaluating and improving therapy chatbots by measuring clinical fidelity and safety, using automated scoring and expert-validated benchmarks.
Contribution
It introduces a novel evaluation framework and training methods for therapy chatbots that focus on clinical fidelity and safety, addressing gaps in existing benchmarks.
Findings
Models trained with THERAPYGYM show significant improvements in clinical fidelity.
The framework enables scalable development of safer, more faithful therapy chatbots.
Expert ratings validate the effectiveness of the evaluation and training methods.
Abstract
Large language models (LLMs) are increasingly used for mental-health support; yet prevailing evaluation methods--fluency metrics, preference tests, and generic dialogue benchmarks--fail to capture the clinically critical dimensions of psychotherapy. We introduce THERAPYGYM, a framework that evaluates and improves therapy chatbots along two clinical pillars: fidelity and safety. Fidelity is measured using the Cognitive Therapy Rating Scale (CTRS), implemented as an automated pipeline that scores adherence to CBT techniques over multi-turn sessions. Safety is assessed using a multi-label annotation scheme, covering therapy-specific risks (e.g., failing to address harm or abuse). To mitigate bias and unreliability in LLM-based judges, we further release THERAPYJUDGEBENCH, a validation set of 116 dialogues with 1,270 expert ratings for auditing and calibration against licensed clinicians.…
Peer Reviews
Decision·Submitted to ICLR 2026
The work contributes a new evaluation method for LLM therapy chatbots by focusing on clinically relevant measures of performance; namely, adherence and competence in delivering treatment, and the avoidance of harmful behaviors. Their framework uses a clinically validated scale (CTRS) to evaluate and fine-tune LLMs as therapists.
TherapyGym was not evaluated on a realistic, comprehensive dataset of therapist-patient dialogues (it uses just 116 dialogues, each consisting of only 5 turns per agent). General-purpose and mental health-specific LLMs are often biased and generate false information. Previous work cited in the introduction specifically cautions against using simulated clients and therapists to avoid this and instead suggests evaluating LLMs on real-world therapist-client interactions, making the reliance on sim
- Framing evaluation around clinical constructs (fidelity and safety) is exactly the direction the community needs; CTRS-linked behavioral coding is far more meaningful than generic fluency or preference metrics. - A dual-purpose environment (evaluation plus RL training harness) is attractive and pragmatic—closing the loop from measurement to improvement. - LLM-judge validation against licensed clinicians via THERAPYJUDGEBENCH is a good step toward quantifying judge reliability. - Emphasis on mu
- CTRS automation is underspecified. The paper lacks a clear rubric, item mappings (CTS-R/CTRS to the “9 CBT skills”), scoring granularity (turn vs session), aggregation rules, and reliability analyses, making it hard to audit whether CTRS is faithfully operationalized. - Human rating setup is thin and ambiguous. Only four raters are mentioned; rating units (turn/skill/session) are unclear; inter-rater reliability is missing; claims that gains “transfer to expert ratings” lack quantitative evide
The methodology is sound and rigorously designed. The use of the established CTRS scale provides a validated foundation for assessing therapeutic fidelity. The experiments are comprehensive, covering inter-rater reliability among human experts, alignment between LLM judges and humans, and the end-to-end efficacy of the RL fine-tuning pipeline. The decision to exclude CTRS dimensions with low human-human agreement (e.g., Guided Discovery) before reward modeling is a prudent choice that strengthen
As noted, the simulated patients, while based on cognitive models, may behave more ideally and articulately than real-world patients who are often hesitant, ambivalent, or inarticulate. This could limit the framework's ability to evaluate crucial therapist skills like dealing with resistance, navigating ambiguity, and building rapport with a reluctant client. Further, The four safety categories are a good start but may not encompass all emergent risks in therapeutic AI. For instance, the risk of
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Mental Health Interventions · Mental Health via Writing · Artificial Intelligence in Healthcare and Education
