Multimodal Depression Classification Using Articulatory Coordination Features And Hierarchical Attention Based Text Embeddings
Nadee Seneviratne, Carol Espy-Wilson

TL;DR
This paper presents a multimodal depression detection system combining articulatory features from speech and hierarchical attention-based text embeddings, demonstrating improved accuracy over unimodal methods, especially with limited data.
Contribution
It introduces a novel multimodal depression classifier integrating articulatory coordination features and hierarchical attention text embeddings, with a multi-stage training approach for limited data scenarios.
Findings
7.5% and 13.7% AUC improvements over unimodal classifiers
Effective session-wise prediction with limited training data
Enhanced depression detection accuracy through multimodal integration
Abstract
Multimodal depression classification has gained immense popularity over the recent years. We develop a multimodal depression classification system using articulatory coordination features extracted from vocal tract variables and text transcriptions obtained from an automatic speech recognition tool that yields improvements of area under the receiver operating characteristics curve compared to uni-modal classifiers (7.5% and 13.7% for audio and text respectively). We show that in the case of limited training data, a segment-level classifier can first be trained to then obtain a session-wise prediction without hindering the performance, using a multi-stage convolutional recurrent neural network. A text model is trained using a Hierarchical Attention Network (HAN). The multimodal system is developed by combining embeddings from the session-level audio model and the HAN text model
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Music and Audio Processing
