Resilient Vision-Tabular Multimodal Learning under Modality Missingness
Camillo Maria Caruso, Valerio Guarrasi, Paolo Soda

TL;DR
This paper introduces a multimodal transformer framework for medical data analysis that remains effective despite missing modalities, outperforming baselines across various missingness scenarios.
Contribution
The work presents a novel multimodal transformer architecture with modality-aware fusion and dropout regularization, enhancing robustness to missing data in clinical applications.
Findings
Consistently outperforms baselines under increasing modality missingness.
Demonstrates smoother performance degradation compared to existing methods.
Ablation studies highlight the importance of attention masking and joint fine-tuning.
Abstract
Multimodal deep learning has shown strong potential in medical applications by integrating heterogeneous data sources such as medical images and structured clinical variables. However, most existing approaches implicitly assume complete modality availability, an assumption that rarely holds in real-world clinical settings where entire modalities and individual features are frequently missing. In this work, we propose a multimodal transformer framework for joint vision-tabular learning explicitly designed to operate under pervasive modality missingness, without relying on imputation or heuristic model switching. The architecture integrates three components: a vision, a tabular, and a multimodal fusion encoder. Unimodal representations are weighted through learnable modality tokens and fused via intermediate fusion with masked self-attention, which excludes missing tokens and modalities…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
