AI Safety Training Can be Clinically Harmful
Suhas BN, Andrew M. Sherrill, Rosa I. Arriaga, Chris W. Wiese, Saeed Abdullah

TL;DR
This study reveals that current large language models used in mental health support can cause psychological harm due to safety alignment failures, highlighting the need for rigorous multi-axis evaluation before deployment.
Contribution
It systematically evaluates LLMs in therapeutic scenarios, exposing safety and protocol fidelity failures, and proposes a comprehensive evaluation framework for AI mental health systems.
Findings
Models show high surface acknowledgment but poor therapeutic appropriateness at high severity.
RLHF safety alignment can disrupt therapeutic mechanisms and safety in mental health applications.
Proposes a five-axis evaluation framework for safe deployment of AI in mental health.
Abstract
Large language models are being deployed as mental health support agents at scale, yet only 16% of LLM-based chatbot interventions have undergone rigorous clinical efficacy testing, and simulations reveal psychological deterioration in over one-third of cases. We evaluate four generative models on 250 Prolonged Exposure (PE) therapy scenarios and 146 CBT cognitive restructuring exercises (plus 29 severity-escalated variants), scored by a three-judge LLM panel. All models scored near-perfectly on surface acknowledgment (~0.91-1.00) while therapeutic appropriateness collapsed to 0.22-0.33 at the highest severity for three of four models, with protocol fidelity reaching zero for two. Under CBT severity escalation, one model's task completeness dropped from 92% to 71% while the frontier model's safety-interference score fell from 0.99 to 0.61. We identify a systematic, modality-spanning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
