Self-supervised Multimodal Speech Representations for the Assessment of Schizophrenia Symptoms
Gowtham Premananth, Carol Espy-Wilson

TL;DR
This paper presents a self-supervised multimodal speech representation system using VQ-VAE for schizophrenia assessment, effectively classifying symptoms and predicting severity from vocal and facial cues.
Contribution
It introduces a novel VQ-VAE based multimodal representation learning framework for schizophrenia assessment, including severity prediction, outperforming prior methods.
Findings
Outperforms previous models on multi-class classification metrics
Accurately predicts schizophrenia severity scores
Effective multimodal speech representations for clinical assessment
Abstract
Multimodal schizophrenia assessment systems have gained traction over the last few years. This work introduces a schizophrenia assessment system to discern between prominent symptom classes of schizophrenia and predict an overall schizophrenia severity score. We develop a Vector Quantized Variational Auto-Encoder (VQ-VAE) based Multimodal Representation Learning (MRL) model to produce task-agnostic speech representations from vocal Tract Variables (TVs) and Facial Action Units (FAUs). These representations are then used in a Multi-Task Learning (MTL) based downstream prediction model to obtain class labels and an overall severity score. The proposed framework outperforms the previous works on the multi-class classification task across all evaluation metrics (Weighted F1 score, AUC-ROC score, and Weighted Accuracy). Additionally, it estimates the schizophrenia severity score, a task not…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Voice and Speech Disorders · Phonetics and Phonology Research
