# Automated Safety Plan Scoring in Outpatient Mental Health Settings Using Large Language Models: Exploratory Study

**Authors:** Hayoung K Donnelly, Gregory K Brown, Kelly L Green, Ugurcan Vurgun, Sy Hwang, Emily Schriver, Michael Steinberg, Megan E Reilly, Haitisha Mehta, Christa Labouliere, Maria A Oquendo, David Mandell, Danielle L Mowery

PMC · DOI: 10.2196/79010 · JMIR Mental Health · 2026-01-08

## TL;DR

This study explores using AI to automatically score suicide prevention safety plans, aiming to help mental health clinicians improve their practice.

## Contribution

The study introduces the Safety Plan Fidelity Rater, an automated tool using large language models to assess safety plan quality.

## Key findings

- LLaMA 3 and o3-mini outperformed GPT-4 in scoring safety plan components.
- Different scoring systems are recommended for each step based on performance metrics.
- LLMs can provide timely feedback to clinicians, improving safety plan implementation.

## Abstract

The safety planning intervention (SPI) is a suicide prevention intervention that results in a written plan to help patients reduce suicide risk. High-quality safety plans—that is, those that are the most complete, personalized, and specific—are more effective in reducing suicide risk. Measuring SPI quality is labor-intensive, which means that clinicians rarely get specific, actionable feedback on their use of the SPI.

This study aimed to develop the Safety Plan Fidelity Rater, an automated tool that assesses the quality of written safety plans leveraging 3 large language models (LLMs)—GPT-4, LLaMA 3, and o3-mini.

Using 266 deidentified safety plans from outpatient mental health settings in New York, LLMs analyzed four key steps: warning signs, internal coping strategies, making the environment safe, and reasons for living. We compared the predictive performance of the three LLMs, optimizing scoring systems, prompts, and parameters.

Findings showed that LLaMA 3 and o3-mini outperformed GPT-4, with different step-specific scoring systems recommended based on weighted F1-scores.

These findings highlight LLMs’ potential to provide clinicians with timely and accurate feedback on safety plan quality, which could greatly improve its implementation in community practice.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12782459/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12782459/full.md

## References

16 references — full list in the complete paper: https://tomesphere.com/paper/PMC12782459/full.md

---
Source: https://tomesphere.com/paper/PMC12782459