TL;DR
This paper introduces a multi-level attention network that fuses text, audio, and video data to improve depression prediction accuracy, addressing challenges like data variability and lack of labeled datasets.
Contribution
It proposes a novel multi-modal, multi-level attention architecture for depression prediction, enhancing feature relevance learning across modalities.
Findings
Outperforms baseline by 17.52% in RMSE
Develops multiple regression models for each modality
Analyzes impact of different feature fusion configurations
Abstract
Depression has been the leading cause of mental-health illness worldwide. Major depressive disorder (MDD), is a common mental health disorder that affects both psychologically as well as physically which could lead to loss of lives. Due to the lack of diagnostic tests and subjectivity involved in detecting depression, there is a growing interest in using behavioural cues to automate depression diagnosis and stage prediction. The absence of labelled behavioural datasets for such problems and the huge amount of variations possible in behaviour makes the problem more challenging. This paper presents a novel multi-level attention based network for multi-modal depression prediction that fuses features from audio, video and text modalities while learning the intra and inter modality relevance. The multi-level attention reinforces overall learning by selecting the most influential features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
