Dynamic Fusion Multimodal Network for SpeechWellness Detection
Wenqiang Sun, Han Yin, Jisheng Bai, Jianfeng Chen

TL;DR
This paper presents a lightweight multimodal system with dynamic fusion for speechwellness detection, integrating acoustic and semantic features to improve accuracy while reducing model complexity.
Contribution
It introduces a dynamic fusion mechanism and combines time-domain, time-frequency, and semantic features in a lightweight model for better speechwellness detection.
Findings
Achieved 78% reduction in model parameters.
Improved detection accuracy by 5%.
Outperformed baseline models in experiments.
Abstract
Suicide is one of the leading causes of death among adolescents. Previous suicide risk prediction studies have primarily focused on either textual or acoustic information in isolation, the integration of multimodal signals, such as speech and text, offers a more comprehensive understanding of an individual's mental state. Motivated by this, and in the context of the 1st SpeechWellness detection challenge, we explore a lightweight multi-branch multimodal system based on a dynamic fusion mechanism for speechwellness detection. To address the limitation of prior approaches that rely on time-domain waveforms for acoustic analysis, our system incorporates both time-domain and time-frequency (TF) domain acoustic features, as well as semantic representations. In addition, we introduce a dynamic fusion block to adaptively integrate information from different modalities. Specifically, it applies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMental Health via Writing · Emotion and Mood Recognition · Speech Recognition and Synthesis
