LANISTR: Multimodal Learning from Structured and Unstructured Data
Sayna Ebrahimi, Sercan O. Arik, Yihe Dong, Tomas Pfister

TL;DR
LANISTR is a novel attention-based framework that effectively learns from multimodal data including language, images, and structured data, demonstrating significant improvements in real-world tasks with missing modalities.
Contribution
It introduces a masking-based training method and a similarity-based loss for cross-modal learning from large-scale multimodal data with missing modalities.
Findings
Achieves 6.6% AUROC improvement on healthcare data
Achieves 14% accuracy improvement on retail data
Robust to high ratios of missing modality samples
Abstract
Multimodal large-scale pretraining has shown impressive performance for unstructured data such as language and image. However, a prevalent real-world scenario involves structured data types, tabular and time-series, along with unstructured data. Such scenarios have been understudied. To bridge this gap, we propose LANISTR, an attention-based framework to learn from LANguage, Image, and STRuctured data. The core of LANISTR's methodology is rooted in \textit{masking-based} training applied across both unimodal and multimodal levels. In particular, we introduce a new similarity-based multimodal masking loss that enables it to learn cross-modal relations from large-scale multimodal data with missing modalities. On two real-world datasets, MIMIC-IV (from healthcare) and Amazon Product Review (from retail), LANISTR demonstrates remarkable improvements, 6.6\% (in AUROC) and 14\% (in accuracy)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
MethodsTest
