Can We Trust LLMs for Mental Health Screening? Consistency, ASR Robustness, and Evidence Faithfulness
Erfan Loweimi, Sofia de la Fuente Garcia, Samira Loveymi, Hadi Daneshvar, Saturnino Luz

TL;DR
This study evaluates the reliability of large language models in mental health screening from speech, focusing on consistency, robustness to speech recognition errors, and faithfulness of evidence, with implications for clinical use.
Contribution
It provides a comprehensive assessment of multiple LLMs' performance in mental health scoring from speech, highlighting robustness and interpretability issues.
Findings
Phi-4 and Gemma-2-9B show high intra-model consistency and robustness to ASR errors.
Llama-3.1-8B is fragile to ASR errors, with ICC dropping significantly at 10% WER.
Keyword groundedness is high for Phi-4 and Gemma-2-9B but lower for Llama-3.1-8B.
Abstract
LLMs can estimate Hospital Anxiety and Depression Scale (HADS) scores from speech in a zero-shot manner, but clinical deployment requires reliability across three dimensions: intra-model consistency, ASR robustness, and evidence faithfulness. We evaluate three LLMs (Phi-4, Gemma-2-9B, and Llama-3.1-8B) on 111 English-speaking participants using ground-truth transcripts and three Whisper ASR variants (Large, Medium, Small), with three independent runs per model-condition pair. We find that (i) Phi-4 and Gemma-2-9B achieve excellent intra-model consistency (ICC > 0.89) with minimal degradation under ASR; (ii) Llama-3.1-8B shows ASR-fragile consistency, with ICC dropping from 0.82 to 0.36 at 10% WER; (iii) predictive validity is largely preserved under ASR for robust models; and (iv) keyword groundedness exceeds 93% for Phi-4 and Gemma-2-9B but falls to 77-81% for Llama-3.1-8B. Inter-model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
