Multimodal Audio-based Disease Prediction with Transformer-based Hierarchical Fusion Network
Jinjin Cai, Ruiqi Wang, Dezhong Zhao, Ziqin Yuan, Victoria McKenna,, Aaron Friedman, Rachel Foot, Susan Storey, Ryan Boente, Sudip Vhaduri, and, Byung-Cheol Min

TL;DR
This paper introduces a transformer-based hierarchical fusion network that effectively combines intra-modal and inter-modal features for multimodal audio-based disease prediction, achieving state-of-the-art results across multiple diseases.
Contribution
It presents a novel hierarchical fusion approach that integrates intra-modal and inter-modal features within a transformer framework for improved disease prediction.
Findings
Achieves state-of-the-art performance on COVID-19, Parkinson's, and dysarthria datasets.
Demonstrates the effectiveness of hierarchical fusion over unilateral strategies.
Shows significant benefits of each model component through ablation studies.
Abstract
Audio-based disease prediction is emerging as a promising supplement to traditional medical diagnosis methods, facilitating early, convenient, and non-invasive disease detection and prevention. Multimodal fusion, which integrates features from various domains within or across bio-acoustic modalities, has proven effective in enhancing diagnostic performance. However, most existing methods in the field employ unilateral fusion strategies that focus solely on either intra-modal or inter-modal fusion. This approach limits the full exploitation of the complementary nature of diverse acoustic feature domains and bio-acoustic modalities. Additionally, the inadequate and isolated exploration of latent dependencies within modality-specific and modality-shared spaces curtails their capacity to manage the inherent heterogeneity in multimodal data. To fill these gaps, we propose a transformer-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing
MethodsFocus
