A multi-modal approach for identifying schizophrenia using cross-modal attention
Gowtham Premananth, Yashish M.Siriwardena, Philip Resnik, Carol, Espy-Wilson

TL;DR
This paper presents a multi-modal system combining audio, video, and text data with cross-modal attention to improve schizophrenia detection, outperforming previous methods by 8.53% in F1 score.
Contribution
It introduces a novel multi-modal classification framework using cross-modal attention and hierarchical models for schizophrenia identification.
Findings
Outperforms previous state-of-the-art by 8.53% in F1 score.
Effectively combines audio, video, and text modalities.
Uses high-level coordination features and hierarchical attention for improved accuracy.
Abstract
This study focuses on how different modalities of human communication can be used to distinguish between healthy controls and subjects with schizophrenia who exhibit strong positive symptoms. We developed a multi-modal schizophrenia classification system using audio, video, and text. Facial action units and vocal tract variables were extracted as low-level features from video and audio respectively, which were then used to compute high-level coordination features that served as the inputs to the audio and video modalities. Context-independent text embeddings extracted from transcriptions of speech were used as the input for the text modality. The multi-modal system is developed by fusing a segment-to-session-level classifier for video and audio modalities with a text model based on a Hierarchical Attention Network (HAN) with cross-modal attention. The proposed multi-modal system…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Music and Audio Processing · Speech Recognition and Synthesis
