An interpretable speech foundation model for depression detection by revealing prediction-relevant acoustic features from long speech

Qingkun Deng; Saturnino Luz; Sofia de la Fuente Garcia

arXiv:2406.03138·cs.SD·March 26, 2026

An interpretable speech foundation model for depression detection by revealing prediction-relevant acoustic features from long speech

Qingkun Deng, Saturnino Luz, Sofia de la Fuente Garcia

PDF

Open Access

TL;DR

This paper introduces an interpretable speech foundation model using long speech segments and a novel interpretation method to improve depression detection accuracy and clinical relevance.

Contribution

It presents a speech-level Audio Spectrogram Transformer and a new interpretation technique, enhancing depression detection and interpretability over segment-level models.

Findings

01

Model outperforms segment-level AST in depression detection.

02

Long speech segments improve detection reliability.

03

Reduced loudness and F0 are identified as depression markers.

Abstract

Speech-based depression detection tools could aid early screening. Here, we propose an interpretable speech foundation model approach to enhance the clinical applicability of such tools. We introduce a speech-level Audio Spectrogram Transformer (AST) to detect depression using long-duration speech instead of short segments, along with a novel interpretation method that reveals prediction-relevant acoustic features for clinician interpretation. Our experiments show the proposed model outperforms a segment-level AST, highlighting the impact of segment-level labelling noise and the advantage of leveraging longer speech duration for more reliable depression detection. Through interpretation, we observe our model identifies reduced loudness and F0 as relevant depression signals, aligning with documented clinical findings. This interpretability supports a responsible AI approach for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Emotion and Mood Recognition