AI Safety Training Can be Clinically Harmful

Suhas BN; Andrew M. Sherrill; Rosa I. Arriaga; Chris W. Wiese; Saeed Abdullah

arXiv:2604.23445·cs.CL·April 28, 2026

AI Safety Training Can be Clinically Harmful

Suhas BN, Andrew M. Sherrill, Rosa I. Arriaga, Chris W. Wiese, Saeed Abdullah

PDF

TL;DR

This study reveals that current large language models used in mental health support can cause psychological harm due to safety alignment failures, highlighting the need for rigorous multi-axis evaluation before deployment.

Contribution

It systematically evaluates LLMs in therapeutic scenarios, exposing safety and protocol fidelity failures, and proposes a comprehensive evaluation framework for AI mental health systems.

Findings

01

Models show high surface acknowledgment but poor therapeutic appropriateness at high severity.

02

RLHF safety alignment can disrupt therapeutic mechanisms and safety in mental health applications.

03

Proposes a five-axis evaluation framework for safe deployment of AI in mental health.

Abstract

Large language models are being deployed as mental health support agents at scale, yet only 16% of LLM-based chatbot interventions have undergone rigorous clinical efficacy testing, and simulations reveal psychological deterioration in over one-third of cases. We evaluate four generative models on 250 Prolonged Exposure (PE) therapy scenarios and 146 CBT cognitive restructuring exercises (plus 29 severity-escalated variants), scored by a three-judge LLM panel. All models scored near-perfectly on surface acknowledgment (~0.91-1.00) while therapeutic appropriateness collapsed to 0.22-0.33 at the highest severity for three of four models, with protocol fidelity reaching zero for two. Under CBT severity escalation, one model's task completeness dropped from 92% to 71% while the frontier model's safety-interference score fell from 0.99 to 0.61. We identify a systematic, modality-spanning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.