TL;DR
CounselBench is a comprehensive benchmark with expert evaluations and adversarial testing, designed to assess large language models' performance and safety in mental health question answering scenarios.
Contribution
It introduces a large-scale, clinically grounded framework with expert annotations and adversarial datasets to evaluate LLMs in mental health contexts.
Findings
LLMs often provide high scores but exhibit safety and relevance issues.
Expert evaluators identify recurring model failures and safety risks.
LLM judges tend to overrate responses and miss safety concerns.
Abstract
Medical question answering (QA) benchmarks often focus on multiple-choice or fact-based tasks, leaving open-ended answers to real patient questions underexplored. This gap is particularly critical in mental health, where patient questions often mix symptoms, treatment concerns, and emotional needs, requiring answers that balance clinical caution with contextual sensitivity. We present CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test large language models (LLMs) in realistic help-seeking scenarios. The first component, CounselBench-EVAL, contains 2,000 expert evaluations of answers from GPT-4, LLaMA 3, Gemini, and online human therapists on patient questions from the public forum CounselChat. Each answer is rated across six clinically grounded dimensions, with span-level annotations and written rationales. Expert evaluations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
