Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores
Shevya Panda, Shinjini Bose, Ananya Joshi

TL;DR
This study systematically evaluates how prompt design and irrelevant inputs influence the reliability of LLM-generated psychiatric hospitalization risk scores, revealing sensitivity and instability in model outputs.
Contribution
It introduces a structured auditing approach to assess the impact of prompt variations and irrelevant data on LLMs in psychiatric risk assessment tasks.
Findings
Medically insignificant features increase risk score variability.
Prompt variations affect model stability in a model-dependent manner.
Including irrelevant data reduces predictive stability across models.
Abstract
Large language models (LLMs) are increasingly utilized in clinical reasoning and risk assessment. However, their interpretive reliability in critical and indeterminate domains such as psychiatry remains unclear. Prior work has identified algorithmic biases and prompt sensitivity in these systems, raising concerns about how contextual information may influence model outputs, but there remains no systematic way to assess these, especially in the psychiatric domain. We propose an approach for reliability auditing downstream LLM tasks by structuring evaluation around the impact of prompt design and the inclusion of medically insignificant inputs on predicted hospitalization risk scores, which is often the first downstream AI clinical-decision-making task. In our audit, a cohort of synthetic patient profiles (n = 50) is generated, each consisting of 15 clinically relevant features and up to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
