Evaluating Large Language Models in Crisis Detection: A Real-World Benchmark from Psychological Support Hotlines

Guifeng Deng; Shuyin Rao; Tianyu Lin; Anlu Dai; Pan Wang; Junyi Xie; Haidong Song; Ke Zhao; Dongwu Xu; Zhengdong Cheng; Tao Li; Haiteng Jiang

arXiv:2506.01329·cs.CL·December 19, 2025

Evaluating Large Language Models in Crisis Detection: A Real-World Benchmark from Psychological Support Hotlines

Guifeng Deng, Shuyin Rao, Tianyu Lin, Anlu Dai, Pan Wang, Junyi Xie, Haidong Song, Ke Zhao, Dongwu Xu, Zhengdong Cheng, Tao Li, Haiteng Jiang

PDF

Open Access

TL;DR

This study evaluates large language models' effectiveness in crisis detection tasks using a real-world benchmark from psychological hotlines, highlighting their potential and limitations in clinical settings.

Contribution

Introduces PsyCrisisBench, a comprehensive benchmark for assessing LLMs in crisis detection, with evaluations across multiple models and tasks in a real-world context.

Findings

01

LLMs achieved high F1 scores in suicidal ideation detection and risk assessment.

02

Few-shot prompting and fine-tuning significantly improved LLM performance.

03

A fine-tuned 1.5B-parameter model outperformed larger models on some tasks.

Abstract

Psychological support hotlines serve as critical lifelines for crisis intervention but encounter significant challenges due to rising demand and limited resources. Large language models (LLMs) offer potential support in crisis assessments, yet their effectiveness in emotionally sensitive, real-world clinical settings remains underexplored. We introduce PsyCrisisBench, a comprehensive benchmark of 540 annotated transcripts from the Hangzhou Psychological Assistance Hotline, assessing four key tasks: mood status recognition, suicidal ideation detection, suicide plan identification, and risk assessment. 64 LLMs across 15 model families (including closed-source such as GPT, Claude, Gemini and open-source such as Llama, Qwen, DeepSeek) were evaluated using zero-shot, few-shot, and fine-tuning paradigms. LLMs showed strong results in suicidal ideation detection (F1=0.880), suicide plan…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPublic Relations and Crisis Communication · Computational and Text Analysis Methods · Mental Health via Writing