MentalBench: A DSM-Grounded Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models

Hoyun Song; Migyeong Kang; Jisu Shin; Jihyun Kim; Chanbi Park; Hangyeol Yoo; Jihyun An; Alice Oh; Jinyoung Han; KyungTae Lim

arXiv:2602.12871·cs.CL·May 19, 2026

MentalBench: A DSM-Grounded Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models

Hoyun Song, Migyeong Kang, Jisu Shin, Jihyun Kim, Chanbi Park, Hangyeol Yoo, Jihyun An, Alice Oh, Jinyoung Han, KyungTae Lim

PDF

1 Repo 4 Datasets

TL;DR

MentalBench is a new benchmark that evaluates large language models' ability to make DSM-5 grounded psychiatric diagnoses using a knowledge graph and synthetic clinical cases, highlighting current model limitations.

Contribution

The paper introduces MentalBench, a DSM-grounded benchmark with a validated knowledge graph and synthetic cases, to assess LLMs' psychiatric diagnostic capabilities.

Findings

01

LLMs perform well on noise-free DSM knowledge queries.

02

Models struggle with confidence calibration in complex, overlapping symptom cases.

03

Current LLMs may not be reliable for psychiatric decision support.

Abstract

Large language models (LLMs) have attracted growing interest as supportive tools for psychiatric assessment and clinical decision support. However, existing mental health benchmarks largely rely on social media data or supportive dialogue settings, limiting their ability to assess whether models can apply formal diagnostic criteria and differential diagnostic rules. In this paper, we introduce MentalBench, a benchmark for evaluating whether LLMs can make DSM-grounded psychiatric diagnostic decisions under varying levels of clinical ambiguity. At the core of MentalBench is MentalKG, a psychiatrist-built and validated knowledge graph encoding DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders. Using MentalKG as an expert-curated logical backbone, we generate 24,750 synthetic clinical cases that systematically vary in information completeness and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hoyuns/MentalBench
github

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMental Health via Writing · Machine Learning in Healthcare · Digital Mental Health Interventions