MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image
Alan Arazi, Eilam Shapira, Shoham Grunblat, Mor Ventura, Elad Hoffer, Gioia Blayer, David Holzm\"uller, Lennart Purucker, Ga\"el Varoquaux, Frank Hutter, Roi Reichart

TL;DR
MulTaBench introduces a comprehensive benchmark of 40 datasets for multimodal tabular learning with text and images, emphasizing task-specific tuning of embeddings to improve predictive performance.
Contribution
It presents the largest image-tabular benchmark to date, highlighting the importance of target-aware representations and providing a platform for developing multimodal tabular foundation models.
Findings
Target-aware representation tuning improves performance across modalities.
Benchmark spans healthcare and e-commerce domains.
Task-specific tuning reduces variance and enhances generalization.
Abstract
Tabular Foundation Models have recently established the state of the art in supervised tabular learning, by leveraging pretraining to learn generalizable representations of numerical and categorical structured data. However, they lack native support for unstructured modalities such as text and image, and rely on frozen, pretrained embeddings to process them. On established Multimodal Tabular Learning benchmarks, we show that tuning the embeddings to the task improves performance. Existing benchmarks, however, often focus on the mere co-occurrence of modalities; this leads to high variance across datasets and masks the benefits of task-specific tuning. To address this gap, we introduce MulTaBench, a benchmark of 40 datasets, split equally between image-tabular and text-tabular tasks. We focus on predictive tasks where the modalities provide complementary predictive signal, and where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
