CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering

Yahan Li; Jifan Yao; John Bosco S. Bunyi; Adam C. Frank; Angel Hsing-Chi Hwang; Ruishan Liu

arXiv:2506.08584·cs.CL·May 15, 2026·2 cites

CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering

Yahan Li, Jifan Yao, John Bosco S. Bunyi, Adam C. Frank, Angel Hsing-Chi Hwang, Ruishan Liu

PDF

1 Video

TL;DR

CounselBench is a comprehensive benchmark with expert evaluations and adversarial testing, designed to assess large language models' performance and safety in mental health question answering scenarios.

Contribution

It introduces a large-scale, clinically grounded framework with expert annotations and adversarial datasets to evaluate LLMs in mental health contexts.

Findings

01

LLMs often provide high scores but exhibit safety and relevance issues.

02

Expert evaluators identify recurring model failures and safety risks.

03

LLM judges tend to overrate responses and miss safety concerns.

Abstract

Medical question answering (QA) benchmarks often focus on multiple-choice or fact-based tasks, leaving open-ended answers to real patient questions underexplored. This gap is particularly critical in mental health, where patient questions often mix symptoms, treatment concerns, and emotional needs, requiring answers that balance clinical caution with contextual sensitivity. We present CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test large language models (LLMs) in realistic help-seeking scenarios. The first component, CounselBench-EVAL, contains 2,000 expert evaluations of answers from GPT-4, LLaMA 3, Gemini, and online human therapists on patient questions from the public forum CounselChat. Each answer is rated across six clinically grounded dimensions, with span-level annotations and written rationales. Expert evaluations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering· slideslive