Resilient Vision-Tabular Multimodal Learning under Modality Missingness

Camillo Maria Caruso; Valerio Guarrasi; Paolo Soda

arXiv:2605.12031·cs.LG·May 13, 2026

Resilient Vision-Tabular Multimodal Learning under Modality Missingness

Camillo Maria Caruso, Valerio Guarrasi, Paolo Soda

PDF

TL;DR

This paper introduces a multimodal transformer framework for medical data analysis that remains effective despite missing modalities, outperforming baselines across various missingness scenarios.

Contribution

The work presents a novel multimodal transformer architecture with modality-aware fusion and dropout regularization, enhancing robustness to missing data in clinical applications.

Findings

01

Consistently outperforms baselines under increasing modality missingness.

02

Demonstrates smoother performance degradation compared to existing methods.

03

Ablation studies highlight the importance of attention masking and joint fine-tuning.

Abstract

Multimodal deep learning has shown strong potential in medical applications by integrating heterogeneous data sources such as medical images and structured clinical variables. However, most existing approaches implicitly assume complete modality availability, an assumption that rarely holds in real-world clinical settings where entire modalities and individual features are frequently missing. In this work, we propose a multimodal transformer framework for joint vision-tabular learning explicitly designed to operate under pervasive modality missingness, without relying on imputation or heuristic model switching. The architecture integrates three components: a vision, a tabular, and a multimodal fusion encoder. Unimodal representations are weighted through learnable modality tokens and fused via intermediate fusion with masked self-attention, which excludes missing tokens and modalities…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.