An interpretable speech foundation model for depression detection by revealing prediction-relevant acoustic features from long speech
Qingkun Deng, Saturnino Luz, Sofia de la Fuente Garcia

TL;DR
This paper introduces an interpretable speech foundation model using long speech segments and a novel interpretation method to improve depression detection accuracy and clinical relevance.
Contribution
It presents a speech-level Audio Spectrogram Transformer and a new interpretation technique, enhancing depression detection and interpretability over segment-level models.
Findings
Model outperforms segment-level AST in depression detection.
Long speech segments improve detection reliability.
Reduced loudness and F0 are identified as depression markers.
Abstract
Speech-based depression detection tools could aid early screening. Here, we propose an interpretable speech foundation model approach to enhance the clinical applicability of such tools. We introduce a speech-level Audio Spectrogram Transformer (AST) to detect depression using long-duration speech instead of short segments, along with a novel interpretation method that reveals prediction-relevant acoustic features for clinician interpretation. Our experiments show the proposed model outperforms a segment-level AST, highlighting the impact of segment-level labelling noise and the advantage of leveraging longer speech duration for more reliable depression detection. Through interpretation, we observe our model identifies reduced loudness and F0 as relevant depression signals, aligning with documented clinical findings. This interpretability supports a responsible AI approach for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Emotion and Mood Recognition
