Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare
Max Lamparth, Declan Grabb, Amy Franks, Scott Gershan, Kaitlyn N. Kunstman, Aaron Lulla, Monika Drummond Roots, Manu Sharma, Aryan Shrivastava, Nina Vasan, Colleen Waickman

TL;DR
This paper introduces a detailed, expert-annotated dataset for mental healthcare decision-making that captures real-world clinical complexities and biases, enabling more accurate evaluation of language models in mental health contexts.
Contribution
The creation of a clinician-annotated, demographic-variable dataset for mental health tasks, addressing gaps in existing benchmarks and enabling bias and performance analysis.
Findings
Models show varying accuracy across tasks.
Demographic information influences decision-making.
Free-form responses often deviate from expert annotations.
Abstract
Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions. In psychiatry especially, these challenges are worsened by fairness and bias issues, since models can be swayed by patient demographics even when those factors should not influence clinical decisions. Thus, we present an expert-created and annotated dataset spanning five critical domains of decision-making in mental healthcare: treatment, diagnosis, documentation, monitoring, and triage. This U.S.-centric dataset - created without any LM assistance - is designed to capture the nuanced clinical reasoning and daily ambiguities mental health practitioners encounter, reflecting the inherent complexities of care delivery that are missing from existing datasets. Almost all base questions with five…
Peer Reviews
Decision·ICLR 2026 Poster
1. The use of multiple clinicians on the same items would be valuable to show inter-rater agreement through statistics such as kappa or intra-class correlation. This would document label consistency, separate real clinical ambiguity from noise, and build confidence in the reliability of the resource. This shows label consistency, highlights true clinical ambiguity, and increases confidence in the dataset’s reliability. 2. The focus on everyday, ambiguous decision-making in psychiatry is a clea
1. A key weakness is the free-form evaluation relies on a single semantic similarity signal. The study measures inconsistency as one minus BERTScore using a DeBERTa xlarge model fine-tuned on MNLI, then aggregates with bootstrap resampling. This narrow view can miss clinically reasonable phrasings and may not reflect calibration. Reporting multiple signals, such as ROUGE or BLEU, would give a more stable picture of free-form behavior. 2. The core dataset is small for the breadth of claims. The
1. Clinician-first design with verification and a thoughtful ambiguity pipeline that avoids LM-generated contamination during creation. 2. Methodological clarity: transparent HBT formulation (annotator effects), bootstrap CIs, and explicit prompt templates. 3. Writing is clear and is generally easy to follow.
1. Post-hoc zeroing of 'objectively wrong' answers (then renormalizing) encodes expert priors that bypass the HBT inference. The paper states most zeroed options had p ≤ 0.2 anyway, but no sensitivity analysis quantifies how many questions' top-ranked answer changes under alternative thresholds or no clamping. This risks conflating empirical disagreement with ex-post correctness. 2. While Section C justifies jurisdiction-specificity, the abstract/intro should state this limitation upfront.
- A fully human generated dataset is an exciting contribution. - The dataset’s multiple category framework provides an informative lens into which aspects of mental health are falling behind in current LLMs, which is very informative and actionable. Models perform less well in triage and documentation domains, which is an important insight. - Comparison of questions across demographic variables is clever, informative, and comprehensive. - The inclusion of preference annotations and multiple corr
- The description of the related work in the mental health space is confusing. It would be helpful to contrast which of these studies introduce benchmarks and to state an overall understanding of the state of LLM performance on mental health tasks. What are the generally accepted specific gaps or are they unknown? - Appendix plots of model performance across demographic variables do not suggest significance for the differences across them within groups. This is confusing given the claims in the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsBalanced Selection
