"Mirror" Language AI Models of Depression are Criterion-Contaminated
Tong Li, Rasiq Hussain, Mehak Gupta, and Joshua R. Oltmanns

TL;DR
This study reveals that language AI models predicting depression from assessment responses are biased due to criterion contamination, and suggests using external language sources for more valid predictions.
Contribution
It compares 'Mirror' models relying on assessment responses with 'Non-Mirror' models using external language, highlighting contamination issues and proposing more valid approaches.
Findings
Mirror models show near-perfect prediction but are biased.
Non-Mirror models also produce large predictions.
Both models correlate similarly with depression symptoms.
Abstract
Recent studies show near-perfect language-based predictions of depression scores (R2 = .70), but these "Mirror" models rely on language responses directly from depression assessments to predict depression assessment scores. These methods suffer from criterion contamination that inflate prediction estimates. We compare "Mirror" models to "Non-Mirror" models, which use other external language to predict depression scores. 110 participants completed both structured diagnostic (Mirror condition) and life history (Non-Mirror condition) interviews. LLMs were prompted to predict diagnostic depression scores. As expected, Mirror models were near-perfect. However, Non-Mirror models also displayed prediction sizes considered large in psychology. Further, both Mirror and Non-Mirror predictions correlated with other questionnaire-based depression symptoms at similar sizes, suggesting bias in Mirror…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMental Health via Writing
