PsychBench: Auditing Epidemiological Fidelity in Large Language Model Mental Health Simulations
Patrick Keough

TL;DR
This study introduces PsychBench, an epidemiological audit of large language model (LLM) mental health simulations, revealing models produce plausible individuals but misrepresent population distributions and encode biases.
Contribution
First comprehensive epidemiological evaluation of LLM patient simulations highlighting population-level validity issues and biases in mental health modeling.
Findings
Models produce clinically plausible individuals but misrepresent population distributions.
Variance compression reduces population diversity, especially in clinical tails.
Models overestimate depression severity and encode racialized and gendered biases.
Abstract
Large language models are increasingly deployed to simulate patients for clinical training, research, and mental health tools, yet population-level validity remains largely untested. We introduce PsychBench, the first epidemiological audit of LLM patient simulation: 28,800 profiles from four frontier models (GPT-4o-mini, DeepSeek-V3, Gemini-3-Flash, GLM-4.7) evaluated against NHANES and NESARC-III baselines across 120 intersectional cohorts. The central finding is a coherence-fidelity dissociation: models produce clinically plausible individuals while misrepresenting the populations they are drawn from. Variance compression ranges from 14 percent (GLM-4.7) to 62 percent (DeepSeek-V3), eliminating the distributional tails of clinical reality. Despite test-retest correlations above r = 0.90, 36.66 percent of cases cross diagnostic thresholds between runs. Symptom correlation matrices…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
