Multimodal Audio-based Disease Prediction with Transformer-based   Hierarchical Fusion Network

Jinjin Cai; Ruiqi Wang; Dezhong Zhao; Ziqin Yuan; Victoria McKenna,; Aaron Friedman; Rachel Foot; Susan Storey; Ryan Boente; Sudip Vhaduri; and; Byung-Cheol Min

arXiv:2410.09289·cs.SD·December 17, 2024

Multimodal Audio-based Disease Prediction with Transformer-based Hierarchical Fusion Network

Jinjin Cai, Ruiqi Wang, Dezhong Zhao, Ziqin Yuan, Victoria McKenna,, Aaron Friedman, Rachel Foot, Susan Storey, Ryan Boente, Sudip Vhaduri, and, Byung-Cheol Min

PDF

Open Access

TL;DR

This paper introduces a transformer-based hierarchical fusion network that effectively combines intra-modal and inter-modal features for multimodal audio-based disease prediction, achieving state-of-the-art results across multiple diseases.

Contribution

It presents a novel hierarchical fusion approach that integrates intra-modal and inter-modal features within a transformer framework for improved disease prediction.

Findings

01

Achieves state-of-the-art performance on COVID-19, Parkinson's, and dysarthria datasets.

02

Demonstrates the effectiveness of hierarchical fusion over unilateral strategies.

03

Shows significant benefits of each model component through ablation studies.

Abstract

Audio-based disease prediction is emerging as a promising supplement to traditional medical diagnosis methods, facilitating early, convenient, and non-invasive disease detection and prevention. Multimodal fusion, which integrates features from various domains within or across bio-acoustic modalities, has proven effective in enhancing diagnostic performance. However, most existing methods in the field employ unilateral fusion strategies that focus solely on either intra-modal or inter-modal fusion. This approach limits the full exploitation of the complementary nature of diverse acoustic feature domains and bio-acoustic modalities. Additionally, the inadequate and isolated exploration of latent dependencies within modality-specific and modality-shared spaces curtails their capacity to manage the inherent heterogeneity in multimodal data. To fill these gaps, we propose a transformer-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing

MethodsFocus