Expert Evaluation and the Limits of Human Feedback in Mental Health AI Safety Testing

Kiana Jafari; Paul Ulrich Nikolaus Rust; Duncan Eddy; Robbie Fraser; Nina Vasan; Darja Djordjevic; Akanksha Dadlani; Max Lamparth; Eugenia Kim; Mykel Kochenderfer

arXiv:2601.18061·cs.AI·May 12, 2026

Expert Evaluation and the Limits of Human Feedback in Mental Health AI Safety Testing

Kiana Jafari, Paul Ulrich Nikolaus Rust, Duncan Eddy, Robbie Fraser, Nina Vasan, Darja Djordjevic, Akanksha Dadlani, Max Lamparth, Eugenia Kim, Mykel Kochenderfer

PDF

TL;DR

This study reveals significant expert disagreement in evaluating mental health AI safety, especially on critical issues like suicide, highlighting limitations of human feedback for safety assessment.

Contribution

The paper demonstrates that expert judgments in mental health AI safety are highly inconsistent, challenging the assumption that aggregated human feedback provides a reliable ground truth.

Findings

01

Inter-rater reliability was consistently poor among psychiatrists evaluating responses.

02

Disagreement was highest on safety-critical items like suicide and self-harm.

03

Expert disagreement reflects coherent but incompatible clinical frameworks, not measurement error.

Abstract

Learning from human feedback~(LHF) assumes that expert judgments, appropriately aggregated, yield valid ground truth for training and evaluating AI systems. We tested this assumption in mental health, where high safety stakes make expert consensus essential. Three certified psychiatrists independently evaluated LLM-generated responses using a calibrated rubric. Despite similar training and shared instructions, inter-rater reliability was consistently poor ( $I C C$ $0.087$ -- $0.295$ ), falling below thresholds considered acceptable for consequential assessment. Disagreement was highest on the most safety-critical items. Suicide and self-harm responses produced greater divergence than any other category, and was systematic rather than random. One factor yielded negative reliability (Krippendorff's $α = - 0.203$ ), indicating structured disagreement worse than chance. Qualitative interviews…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.