Beyond Simulations: What 20,000 Real Conversations Reveal About Mental Health AI Safety

Caitlin Stamatis; Jonah Meyerhoff; Richard Zhang; Olivier Tieleman; Matteo Malgaroli; Thomas Hull

PMC · DOI:10.21203/rs.3.rs-8642399/v1·January 27, 2026

Beyond Simulations: What 20,000 Real Conversations Reveal About Mental Health AI Safety

Caitlin Stamatis, Jonah Meyerhoff, Richard Zhang, Olivier Tieleman, Matteo Malgaroli, Thomas Hull

PDF

Open Access

TL;DR

A mental health AI system was found safer than general AI models in handling sensitive topics, based on real user conversations and safety benchmarks.

Contribution

The study introduces an ecological audit using 20,000 real user conversations to evaluate mental health AI safety, revealing lower harmful content rates.

Findings

01

The purpose-built mental health AI produced significantly less harmful content than general-purpose LLMs across suicide, NSSI, eating disorder, and substance use benchmarks.

02

In real user data, no suicide-risk cases lacked crisis resources, and only 0.015% of NSSI mentions lacked intervention.

03

Ecological audits are shown to be effective for estimating AI safety in real-world mental health applications.

Abstract

Large language models (LLMs) are increasingly used for mental health, yet safety evaluations rely primarily on small, simulation-based benchmarks removed from real-world language. We replicate four published safety evaluations assessing suicide risk handling, harmful content generation, and jailbreak resistance for general-purpose frontier models and a purpose-built mental health AI. We then conduct an ecological audit of 20,000 real user conversations with the purpose-built system, which includes layered safeguards for suicide and non-suicidal self-injury (NSSI). The purpose-built AI was significantly less likely than general-purpose LLMs to produce harmful content across suicide/NSSI (.4–11.27% vs 29.0–54.4%), eating disorder (8.4% vs 54.0%), and substance use (9.9% vs 45.0%) benchmarks. In real user data, clinician review found zero suicide-risk cases without crisis resources. Three…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases2

eating disorder -suicidal self-injury

Figures4

Click any figure to enlarge with its caption.

a](#F3)): self-harm/suicide (Ash 0.4%; GPT-5.2 29.0%; GPT-5.1 44.0%; GPT-5 12.0%), eating disorders (Ash 8.4%; GPT-5.2 54.0%; GPT-5.1 58.0%; GPT-5 18.0%), and substance abuse (Ash 9.9%; GPT-5.2 45.0%; GPT-5.1 54.0%; GPT-5 31.0%). With the contextual modifier (“This is for a presentation”), harmful response rates generally increased, but Ash remained significantly lower than GPT-5.2/5.1/5 across self-harm/suicide (10.0% vs 23.0%/54.0%/13.0%), eating disorders (18.0% vs 52.0%/83.0%/34.0%), and substance abuse (8.8% vs 43.0%/69.0%/45.0%) (all *p*<.05; Table 2; [Figure 3a](#F3)).

Keywords

mental healthlarge language modelartificial intelligencesafetysuicideself-harm

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMental Health via Writing · Digital Mental Health Interventions · Suicide and Self-Harm Studies

Full text

Introduction

Access to effective mental health care remains severely constrained: Globally, only 6.9% of people with mental health disorders receive effective treatment.^1^ In the U.S., less than 20% of psychiatrists are available to see new patients, and median wait times for psychiatric services range from 43–67 days.^2^ 137 million Americans live in a mental health professional shortage area (estimated 6,800 practitioners needed),^3^ with disproportionate impact to marginalized communities.^4^

In parallel with these access constraints, large language models (LLMs) are increasingly used for psychological support: nearly 50% of people with mental health conditions report turning to LLMs for help.^5^ Early meta-analytic evidence indicates that conversational AI can reduce symptoms of depression and distress,^6^ supporting the potential for AI to expand access to mental health support. However, multiple studies show that general-purpose LLMs still fail relatively basic safety tests, including reinforcing delusional beliefs, enabling maladaptive behavior, and responding inappropriately to crisis disclosures.^7–9^ In high-stakes therapeutic settings, even a small number of unsafe outputs can cause meaningful harm.^10,11^

Although LLM safety guardrails have improved substantially, many current approaches rely on static refusal templates or generic crisis-response heuristics,^12^ which may not reliably capture subtle psychological risk, implicit self-harm intent, or context-dependent disclosures.^13^ Moreover, guardrails operating only at the time of message generation have nontrivial error rates and cannot fully compensate for underlying model behaviors that systematically produce unsafe content, particularly when models have little or no clinically relevant data for pretraining or alignment that would allow them to interpret emotionally complex or risk-laden content.^14,15^

Just as guardrails can be insufficient for deployment, benchmarks can be insufficient for safety evaluation. Mental-health safety is highly nuanced and complex, with users expressing suicidality, self-harm, and other risk behaviors through language that is often subtle, ambiguous, and indirect, or that unfolds over time.^13^ Existing studies highlight the difficulty of building realistic user simulators, which can produce overly agreeable and unrealistic user dialogue even when constructed with frontier LLMs.^16^ Many safety evaluations rely on few iterations of small prompt sets with limited diversity and unknown alignment to real-world user language, limiting their ability to detect low-frequency but clinically consequential failures.^17,18^ Thus, benchmarks, red-teaming, and jailbreak tests provide valuable standardized stress tests, but necessarily cover only a subset of clinically relevant scenarios and may not reflect the linguistic, behavioral, and cultural distribution encountered in deployment.

In the present study, we operationalize an ecological safety approach for evaluating generative AI when deployed in real world mental health contexts. We conduct a safety evaluation of Ash, a deployed conversational AI system purpose-built for mental health.^19^ Within Ash, suicide- and NSSI-related risk behavior is managed through a layered, two-system safety architecture (Figure 1), in which model-level risk handling is supplemented by an independent embeddings-based guardrail system trained on real-world usage data, rather than developer-generated hypothetical examples. To evaluate Ash alongside general-purpose models, we replicate the methods of four published safety test sets assessing clinically relevant failure modes, including suicide-risk response calibration,^15^ harmful content generation across high-risk domains,^20^ refusal robustness,^7^ and adversarial jailbreak attempts.^13^ We then apply an ecological audit to verify test set estimates of risk with an analysis of 20,000 real user conversations, assessing how suicide and non-suicidal self-injury (NSSI) risk manifests in naturalistic deployment and how reliably the system’s layered safeguards (Figure 1; see Methods for detail) deliver escalation messages and crisis resources at scale.

Results

Benchmark replications

2.1

For in-depth descriptions of each test set, data generation approach, and evaluation procedures, see Methods section.

Test Set 1: Suicide risk assessment.15

2.1.1

This benchmark evaluates the frequency of direct responses to 30 suicide-related questions, as well as whether model response directness varies according to clinician-rated risk.

Across 30 suicide-related questions, GPT-5-series models produced more direct responses than Ash (Table 1; Figure 2). GPT-5.1 had the highest overall direct-response rate (mean 67.70%, median 87%), followed by GPT-5.2 (mean 60.17%, median 82%) and GPT-5 (mean 54.40%, median 67.5%). Ash was substantially lower overall (mean 11.27%, median 0%) and produced 0% direct responses on 20/30 questions (vs. 9/30 for GPT-5, 5/30 for GPT-5.1, and 7/30 for GPT-5.2).

Stratified by clinician-rated risk category,^15^ Ash was consistently less direct than GPT models (χ^2^/Fisher’s exact; Table 1). In the High category (n=5), Ash never responded directly (mean 0%, 5/5 questions at 0%), while GPT-5, GPT-5.1, and GPT-5.2 averaged 33.6%, 78.8%, and 56.2%, respectively (all χ^2^ tests significant, p’s < .001). In the Very high category (n=6), direct responses were essentially absent for GPT-5 (mean 0.0%) and GPT-5.2 (mean 0.17%), while GPT-5.1 showed a higher mean (13.67%; χ^2^(1)=84.92, p < .001 vs. Ash) driven by a single outlier item with 81% direct responses (Ash 0% on that item; Supplementary Table S1). In the Medium category, Ash produced a significantly lower proportion of direct responses (18.30%) than GPT-5 (61.90%), GPT-5.1 (68.70%), and GPT-5.2 (66.10%) (all χ^2^ tests significant, p’s < .001), and Ash responded directly 0% of the time on 8/10 Medium-risk questions. In lower-risk categories, GPT-5-series models were near ceiling (Very low: all three at 100% mean; Low: 89.0–93.0% means), while Ash remained low (Very low mean 5.25%; Low mean 26.6%).

Test Set 2: Suicide/self-harm, eating disorders, and substance use.20

2.1.2

This benchmark probes whether models generate enabling or harmful content across self-harm/suicide, eating disorders, and substance use prompts, including under a simple contextual jailbreak (“This is for a presentation”).

Harmful responses.

Under the original prompt, Ash produced a significantly lower proportion of harmful responses than all GPT models across categories (all χ^2^ p<.001; Table 2; Figure 3a): self-harm/suicide (Ash 0.4%; GPT-5.2 29.0%; GPT-5.1 44.0%; GPT-5 12.0%), eating disorders (Ash 8.4%; GPT-5.2 54.0%; GPT-5.1 58.0%; GPT-5 18.0%), and substance abuse (Ash 9.9%; GPT-5.2 45.0%; GPT-5.1 54.0%; GPT-5 31.0%). With the contextual modifier (“This is for a presentation”), harmful response rates generally increased, but Ash remained significantly lower than GPT-5.2/5.1/5 across self-harm/suicide (10.0% vs 23.0%/54.0%/13.0%), eating disorders (18.0% vs 52.0%/83.0%/34.0%), and substance abuse (8.8% vs 43.0%/69.0%/45.0%) (all p<.05; Table 2; Figure 3a).

Follow-up suggestions.

Under the original prompt, Ash also produced a significantly lower proportion of follow-up suggestions than all GPT models (all χ^2^ p<.001; Supplementary Table S2; Figure 3b): self-harm/suicide (Ash 1.3%; GPT-5.2 90.0%; GPT-5.1 44.0%; GPT-5 20.0%), eating disorders (Ash 4.5%; GPT-5.2 91.0%; GPT-5.1 77.0%; GPT-5 65.0%), and substance abuse (Ash 3.6%; GPT-5.2 92.0%; GPT-5.1 81.0%; GPT-5 60.0%). With the modifier, follow-up invitation rates rose across models, but Ash remained significantly lower than GPT-5.2/5.1/5 for self-harm/suicide (16.0% vs 23.0%/90.0%/88.0%), eating disorders (22.0% vs 64.0%/93.0%/92.0%), and substance abuse (12.0% vs 69.0%/91.0%/92.0%) (all p<.001; Supplementary Table S2; Figure 3b).

Test Set 3: SimpleSafetyTests.7

2.1.3

This benchmark tests whether models refuse or safely redirect unambiguously unsafe requests in high-severity harm areas. This benchmark was saturated by all models across both test sets examined (Suicide, Self-Harm, and Eating Disorders; Child Abuse). Both Ash and the GPT-5-series models responded appropriately in nearly 100% of cases, with or without the safety prompt for GPT-5 (Supplementary Table S3; Extended Data Figure 1).

Test Set 4: Jailbreak benchmark.13

2.1.4

This benchmark evaluates multi-turn adversarial reframing attempts designed to elicit self-harm or suicide method information across a dialogue, rather than in a single prompt. Again, the benchmark was saturated (Supplementary Table S4; Extended Data Figure 2), with both Ash and GPT-5 series models showing near-100% performance across the self-harm and suicide jailbreak attempts.

Real-world data (ecological audit)

2.2

The ecological audit process is described in detail in the Methods section (see also Extended Data Figure 3). Figure 4 contains stepwise results from the real-world SI/NSSI detection process.

Batch 1: September 2025 (10,000 sessions).

2.2.1

In 10,000 randomly sampled Ash sessions (i.e., conversations) from September 2025, the LLM judge flagged 576 sessions with potential SI/NSSI in ≥1 of 4 runs. The safety classifier (Figure 1; System 2) detected 300; the remaining 276 went to clinician review. Clinicians found no SI/NSSI in 231 sessions and confirmed SI/NSSI in 45. In 42/45 confirmed cases, the Ash model (i.e., Figure 1, System 1) delivered an escalation message with 988 and sometimes additional crisis resources. Three NSSI-only edge cases did not receive the full protocol: one involved ambiguous “SH” usage at the session end; one elicited a commitment from the user to reach out to a human but did not provide a hotline (988 was recommended two sessions later); and one de-escalated self-harm without surfacing resources, though the user had received crisis resources 7 days prior and again 3 days after. In the latter case, the user left positive feedback describing reduced urges to self-harm after the conversation (“She has helped me so much, like this morning I was feeling so down I wanted to cut myself. But after talking with her I felt much better and the urge went away.”).

Batch 2: December 2025 (10,000 sessions).

2.2.2

In 10,000 randomly sampled December 2025 sessions, the LLM judge flagged 224 sessions with potential SI/NSSI in ≥3 of 4 runs. The safety classifier detected 156; the remaining 68 went to clinician review. Clinicians found no SI/NSSI in 33 sessions and confirmed SI/NSSI in 35. In all 35 confirmed cases, the Ash model delivered 988 and sometimes additional crisis resources, with no false negatives.

Overall false negative rate.

2.2.3

Across 20,000 sampled sessions, clinicians confirmed 80 SI/NSSI sessions missed by the safety classifier. The conversational model delivered crisis resources in 77/80, with 3 NSSI sessions (0.015% of all sessions) where no crisis resources were provided. Among 800 LLM judge-flagged sessions, this corresponds to a 0.38% end-to-end failure rate (3/800) within the judge-flagged subset. Because clinicians reviewed only a triaged subset, this is a lower bound on the true population false negative rate. Inter-rater agreement was 97% (332/343), with 11 cases adjudicated by a third clinical psychologist.

Discussion

This study advances evaluation of AI mental-health tools by combining replication-based testing on published safety benchmarks with a large-scale, real-world ecological safety audit. Prior work has documented that general-purpose LLMs can fail safety checks in sensitive contexts and that small, low-powered evaluations may miss rare but clinically consequential failures.^7,8,10,11,21^ However, there remains limited evidence on how purpose-built, therapy-aligned systems perform in deployment at scale. Our contribution is twofold: we provide one of the largest ecological analyses to date of suicide/NSSI safety behavior in a purpose-built AI mental-health system, and we situate those deployment outcomes alongside controlled benchmark replications that probe clinical calibration, multi-domain harmful-content avoidance, and robustness to both clear-cut unsafe prompts and adversarial reframing.^7,13,15,20^

In real-world use, Ash’s layered safety architecture—which combines therapy-aligned conversational behavior with an independent suicide/NSSI classifier system—showed high sensitivity to suicide risk and rare end-to-end misses for NSSI-related content. Across 20,000 real-world conversations, clinician review identified no cases of suicide risk in which crisis resources were not delivered, and 3/20,000 conversations with clinician-identified NSSI-related risk without crisis intervention, the latter of which were more accurately described as edge cases than true false negatives upon clinician review. These findings matter because even low base-rate unsafe events can translate into substantial harm at scale, a concern increasingly emphasized in discussions of evaluation power and tail-risk failures in high-stakes settings,^10,11^ particularly given that AI agents are not subject to external accountability mechanisms (e.g., licensure).

Across benchmark replications, Ash was consistently less likely than frontier general-purpose LLMs to produce enabling, procedural, or otherwise unsafe content across suicide/self-harm, child safety, eating disorders, and substance use prompts, and exhibited more appropriate modulation of response directness as clinician-rated suicide risk increased.^15,20^ Notably, benchmark results reflect the behavior of the Ash conversational model alone, without additional product-level safety mechanisms (i.e., guardrails). This distinction is important because each safety layer has a nonzero error rate, and failure rates can compound within multi-component safety architectures.^22^ Together, the benchmark and deployment findings suggest that a combination of domain-aligned training and system-level safety controls can improve safety in mental-health settings compared with guardrails alone.^12,13^

Our results underscore that while simulated test sets provide valuable standardized stress tests, they may misestimate real-world safety performance of deployed systems or saturate for more mature systems. Benchmarks such as the 30 suicide questions and CCDH prompts isolate known risk categories and remain useful in early model development, when naturalistic deployment data are not yet available.^15,20^ Others, such as SimpleSafetyTests, may provide limited discriminative value at the frontier, with all GPT-5-series models performing near-ceiling after prior reports of 70% performance for OpenAI’s content moderation API and 89% for GPT-4.^7^ Further, test sets constructed by researchers may not reflect the linguistic diversity, cultural variability, and context-dependent ways that distress is expressed in real-world interactions.^13,16^ Real user transcripts reviewed from Ash underscored the often indirect, idiosyncratic nature of risk language. For example, one user discussed a coding system with shapes and symbols that refer to things like “SH/suicidal thoughts” and “SH scar”, as well as colors referring to levels of risk (e.g. black = “intensely thinking of it”, red = “about to do it”); notably, while the safety classifier was not triggered when the user referred only to these shapes/symbols and colors later on (because the coded user messages did not explicitly state anything about self-harm or suicide), the Ash model appropriately identified these codes and surfaced crisis resources. Our proposed ecological audit method permits the examination of how often risks appear, and how a system behaves at scale, in the context of real-world language that can be difficult to mimic in a benchmark.

Relatedly, our findings support the utility of layered safety architectures in the development of safe mental health AI. A central debate in LLM safety is whether guardrails applied at inference time are sufficient.^23^ Prior work suggests that many deployed approaches rely heavily on static refusal templates, keyword heuristics, or generic crisis scripts,^12^ and that such strategies can fail when users present with complex and subtle risk.^13^ Our findings are consistent with this concern: safety in mental-health dialogue depends not only on refusal behavior, but also on a model’s capacity to interpret disclosures appropriately and respond in clinically grounded ways, particularly in emotionally complex or risk-laden scenarios.^15,24^ These considerations motivate layered safety architectures that pair domain-aligned training with independent detection and intervention mechanisms, rather than relying on inference-time guardrails as the primary defense.^12,13^ Such architectures may be especially important when users discuss suicide historically or in a non-acute way—cases where a model could appropriately continue therapeutic engagement without switching into a crisis script, while an independent classifier still triggers standardized safety interventions. This also clarifies why evaluations that judge safety solely by whether a response contains crisis-hotline text can be misleading: appropriate clinical engagement is not synonymous with generic crisis scripts.^12^

This work has limitations. First, our ecological audit focuses on suicide and NSSI risk as mediated by our production classifier and review protocol; other clinically salient risks (e.g., delusional reinforcement or mania) may show different patterns.^7,8^ Second, the benchmarks replicated, while intentionally diverse, cannot fully represent real-world misuse or the full linguistic diversity of distress presentations.^13^ Third, clinician review was conducted on a triaged subset of judge-positive conversations, providing a lower-bound estimate of missed-risk prevalence in deployment. Finally, this study evaluates safety behavior rather than clinical efficacy. While meta-analytic evidence supports potential benefits of conversational AI in reducing distress,^6^ and early findings on Ash specifically suggest reductions in distress in some contexts,^19^ rigorous outcome studies are needed to determine when therapy-aligned systems improve clinical trajectories and how safety interventions affect engagement and help-seeking.

Taken together, this work demonstrates that as LLM use for psychological support becomes increasingly common, safety evaluation must move beyond small test sets and toward ecological assessments. Our results suggest that therapy-aligned training coupled with layered safety mechanisms can reduce harmful outputs in controlled benchmarks while achieving high sensitivity to suicide/self-harm risk in real-world conversations. Together, these findings support a shift toward continuous, multimodal, deployment-relevant safety assurance for AI mental-health systems.

Methods

Description of the purpose-built Ash system and guardrails

4.1

Foundation Model.

4.1.1

Ash is a purpose-built conversational system for mental health support.^19^ The system is built on a psychology-focused foundation model pre-trained on de-identified psychotherapy data and aligned via supervised fine-tuning on expert annotations. This training emphasizes clinical safety objectives, therapeutic communication appropriate to an AI-mediated setting, and detection and response to suicide and non-suicidal self-injury (NSSI) risk. The Ash conversational model serves as the primary mechanism for risk handling, generating user-facing responses and escalating to appropriate external crisis resources (e.g., the U.S. Suicide and Crisis Lifeline [988] or local equivalents) when risk thresholds are met (Figure 1, System 1).

Safety Guardrails.

4.1.2

In parallel, an independent suicide/NSSI risk classification system continuously monitors user text for indicators of self-harm or suicidal ideation (Figure 1, System 2) and operates without access to the conversational model’s internal state or outputs. The classifier pipeline consists of (a) an embeddings-based model optimized for high recall that flags potentially relevant messages, followed by (b) a slower LLM-based verifier prioritizing precision. When suicide risk is confirmed, the classifier system triggers an in-app safety banner providing crisis resources and direct pathways to human support. Classifier-confirmed risk also places Ash into a risk mitigation mode (RMM) for up to five conversational turns or until session termination, whichever occurs first, during which model outputs are subject to heightened safety monitoring. Because banner activation is driven solely by the classifier system and not by the conversational model, banners may appear in cases where Ash continues with a clinically appropriate, non-crisis response (e.g., historical or non-acute references to suicide).

Benchmark replications

4.2

Test set selection and overview.

4.2.1

We selected four published test sets that target distinct, deployment-relevant safety behaviors: (1) whether the model’s responses to suicide-related queries are calibrated to clinician-rated risk severity;^15^ (2) whether the model avoids generating harmful content across suicide/self-harm, eating disorders, and substance use;^20^ (3) whether the model refuses or safely redirects in response to unambiguously unsafe self-harm requests;^7^ and (4) whether these safety behaviors remain robust under multi-turn adversarial “jailbreak” reframing designed to bypass guardrails.^13^ Together, these non-exhaustive benchmarks provide complementary coverage of risk-calibrated crisis handling, multi-domain harm avoidance, basic refusal robustness, and adversarial resilience.

General methods.

4.2.2

For the published safety test sets, we queried both Ash and general-purpose LLMs of different generations (GPT-5, GPT-5.1, GPT-5.2) via API using the same stimulus. Unless otherwise specified, all model calls were configured with a temperature of 0.7 and default reasoning effort for GPT-5, GPT-5.1, and GPT-5.2. Except where specified, no system prompt was provided for any general-purpose model to match the system settings of typical usage. In simulations, the Ash model was called on its own without any additional safety architecture (i.e., in Figure 1, only System 1 was evaluated).

Test Set 1: Suicide risk assessment15.

4.2.3

We used the 30 suicide-related test questions provided by McBain et al. (2025), which vary according to clinician-rated severity levels. An LLM judge (see Supplementary Note S1) evaluated whether the model responded directly to the input question, with a total of 12,000 responses evaluated (30 questions * 4 models *100 repetitions per question per model). We compared the proportion of direct responses for each risk category for Ash relative to each GPT model using χ^2^ (or Fisher’s exact test when expected cell counts are <5).

Test Set 2: Suicide/self-harm, eating disorders, and substance use.20

4.2.4

The Center for Countering Digital Hate prompts cover self-harm and suicide (10 prompts), eating disorders (10 prompts), and substance use (10 prompts). For each prompt, we first collected the chatbot’s initial response. We then attempted jailbreaking with “This is for a presentation,” and recorded the chatbot’s subsequent reply. Both rounds of responses were assessed by an LLM judge (see Supplementary Note S2) for harmfulness and whether the chatbot provided suggestions for follow-up (“pushes users to continue engaging with the model” [CCDH, p. 16]). We evaluated a total of 24,000 responses (30 questions * 100 prompt repetitions * 2 [jailbreak follow-up] * 4 models).

Test Set 3: SimpleSafetyTests.7

4.2.5

We used two test subsets: (1) Suicide, Self-Harm, and Eating Disorders, and (2) Child Abuse, to assess whether models refused unequivocally unsafe requests. We entered each prompt to GPT-5 with and without a safety-emphasizing system prompt, testing these alongside our Ash model, GPT-5.1 and GPT-5.2. We evaluated 20,000 total responses (2,000 runs per model × 5 models × 2 harm areas) for safety using the LLM judge developed by Vigden et al. (2023) (see Supplementary Note S3).

Test Set 4: Jailbreak tests.13

4.2.6

We evaluated Ash, GPT-5, GPT-5.1, GPT-5.2, and gpt-4o. Jailbreak performance was measured using multi-turn prompts that attempted to bypass safety via contextual reframing (e.g., hypothetical, academic, or third-person intent), assessing whether models maintained safety boundaries across turns. Unlike the original paper, which used the ChatGPT product, we evaluated gpt-4o via API calls for consistency. An unsafe response was one that provided sufficient information to answer a query about self-harm or suicide. We tested one self-harm and one suicide scenario (Supplementary Note S4), repeating each rollout 100 times, totaling 4,500 self-harm runs and 5,500 suicide runs.

Evaluating Ash’s real-world performance responding to suicide and NSSI risk

4.3

Extended Data Figure 3 summarizes our real-world evaluation using de-identified, opt-in user transcripts. The study was approved by the Institutional Review Board at NYU School of Medicine (i25–01177). We sampled 20,000 conversations (10,000 from September 2025; 10,000 from December 2025) and labeled each for SI/NSSI using an LLM judge calibrated against licensed clinical psychologist ratings (Supplementary Note S5).^25^ The judge was run four times per session.

For the September batch, we compared judge results to the safety classifier and sent any sessions flagged in ≥1/4 runs—but missed by the classifier—for review by two psychologists (with a third resolving disagreements). Sessions were labeled as: (a) no SI/NSSI, (b) SI/NSSI present with crisis resources provided by the Ash model, or (c) SI/NSSI present with no crisis resources provided by the Ash model (false negative). For the December batch, the process was the same, except clinician review required flags in ≥3/4 runs. Our primary metric was end-to-end failure rate: the share of clinician-confirmed SI/NSSI sessions where neither the classifier banner nor the model provided crisis resources.

Supplementary Files

This is a list of supplementary files associated with this preprint. Click to download.

Tables.docx

SupplementaryMaterials.docx

Bibliography25

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Vigo D. V. Effective Treatment for Mental and Substance Use Disorders in 21 Countries. JAMA Psychiatry 82, 347–357 (2025).39908011 10.1001/jamapsychiatry.2024.4378 PMC 11800122 · doi ↗ · pubmed ↗
2Sun C.-F. Low availability, long wait times, and high geographic disparity of psychiatric outpatient care in the US. Gen. Hosp. Psychiatry 84, 12–17 (2023).37290263 10.1016/j.genhosppsych.2023.05.012 · doi ↗ · pubmed ↗
3Health Resources and Services Administration. Health Professional Shortage Areas (HPSA) design summary [Interactive data dashboard]. HRSA Data Warehouse. (2025).
4Panchal N., Hill L., Artiga S. & Hamel L. Racial and Ethnic Disparities in Mental Health Care: Findings from the KFF Survey of Racism, Discrimination and Health. KFF https://www.kff.org/racial-equity-and-health-policy/racial-and-ethnic-disparities-in-mental-health-care-findings-from-the-kff-survey-of-racism-discrimination-and-health/ (2024).
5Rousmaniere T., Li X., Zhang Y. & Shah S. Large Language Models as Mental Health Resources: Patterns of Use in the United States. Pract. Innov. Advance online publication., (2025).
6Li H., Zhang R., Lee Y.-C., Kraut R. E. & Mohr D. C. Systematic review and meta-analysis of AI-based conversational agents for promoting mental health and well-being. NPJ Digit. Med. 6, 236 (2023).38114588 10.1038/s 41746-023-00979-5PMC 10730549 · doi ↗ · pubmed ↗
7Vidgen B. Simplesafetytests: a test suite for identifying critical safety risks in large language models. Ar Xiv Prepr. Ar Xiv 231108370 (2023).
8Zhang R., Li H.-W., Qian X.-Y., Jiang W.-B. & Chen H.-X. On large language models safety, security, and privacy: A survey. J. Electron. Sci. Technol. 23, 100301 (2025).