A Comprehensive Evaluation of Large Language Models on Mental Illnesses

Abdelrahman Hanafi; Mohammed Saad; Noureldin Zahran; Radwa J. Hanafy; Mohammed E. Fouda

arXiv:2409.15687·cs.AI·November 25, 2025·2 cites

A Comprehensive Evaluation of Large Language Models on Mental Illnesses

Abdelrahman Hanafi, Mohammed Saad, Noureldin Zahran, Radwa J. Hanafy, Mohammed E. Fouda

PDF

Open Access

TL;DR

This study systematically evaluates 33 large language models on mental health tasks using social media data, highlighting their strengths, limitations, and the impact of prompt engineering for mental health applications.

Contribution

It provides the largest-scale evaluation of modern LLMs for mental health, comparing zero-shot and few-shot performance across multiple tasks and models.

Findings

01

GPT-4 and Llama 3 achieved up to 85% accuracy in disorder detection.

02

Few-shot learning improved disorder severity evaluation, reducing MAE by 1.3 points.

03

Llama 3.1 405b achieved 91.2% accuracy in psychiatric knowledge assessment.

Abstract

Large Language Models (LLMs) have shown promise in various domains, including healthcare, with significant potential to transform mental health applications by enabling scalable and accessible solutions. This study aims to provide a comprehensive evaluation of 33 LLMs, ranging from 2 billion to 405+ billion parameters, in performing key mental health tasks using social media data across six datasets. To our knowledge, this represents the largest-scale systematic evaluation of modern LLMs for mental health applications. Models such as GPT-4, Llama 3, Claude, Gemma, Gemini, and Phi-3 were assessed for their zero-shot (ZS) and few-shot (FS) capabilities across three tasks: binary disorder detection, disorder severity evaluation, and psychiatric knowledge assessment. Key findings revealed that models like GPT-4 and Llama 3 exhibited superior performance in binary disorder detection,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMental Health via Writing · Machine Learning in Healthcare

MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · LLaMA · Softmax · Layer Normalization · Dropout