Speech-Based Estimation of Schizophrenia Severity Using Feature Fusion
Gowtham Premananth, Carol Espy-Wilson

TL;DR
This paper presents a deep learning framework that fuses articulatory and self-supervised speech features to accurately estimate schizophrenia severity from speech, achieving significant error reduction over previous models.
Contribution
It introduces a novel feature fusion approach combining articulatory and self-supervised speech features, along with an auto-encoder-based representation learning framework for improved severity estimation.
Findings
Reduced MAE by 9.18% with the proposed model
Reduced RMSE by 9.36% compared to previous models
Effective fusion of articulatory and self-supervised features enhances accuracy
Abstract
Speech-based assessment of the schizophrenia spectrum has been widely researched over in the recent past. In this study, we develop a deep learning framework to estimate schizophrenia severity scores from speech using a feature fusion approach that fuses articulatory features with different self-supervised speech features extracted from pre-trained audio models. We also propose an auto-encoder-based self-supervised representation learning framework to extract compact articulatory embeddings from speech. Our top-performing speech-based fusion model with Multi-Head Attention (MHA) reduces Mean Absolute Error (MAE) by 9.18% and Root Mean Squared Error (RMSE) by 9.36% for schizophrenia severity estimation when compared with the previous models that combined speech and video inputs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition
MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention
