Bag of Tricks for Multimodal AutoML with Image, Text, and Tabular Data
Zhiqiang Tang, Zihan Zhong, Tong He, Gerald Friedland

TL;DR
This paper explores best practices for multimodal AutoML involving image, text, and tabular data, proposing a unified pipeline based on extensive experiments on a new benchmark of 22 diverse datasets.
Contribution
It introduces a comprehensive benchmark for multimodal AutoML and identifies effective strategies, consolidating them into a unified pipeline for improved performance.
Findings
Effective multimodal fusion strategies identified
Data augmentation methods enhance model robustness
Tabular-to-text conversion improves integration
Abstract
This paper studies the best practices for automatic machine learning (AutoML). While previous AutoML efforts have predominantly focused on unimodal data, the multimodal aspect remains under-explored. Our study delves into classification and regression problems involving flexible combinations of image, text, and tabular data. We curate a benchmark comprising 22 multimodal datasets from diverse real-world applications, encompassing all 4 combinations of the 3 modalities. Across this benchmark, we scrutinize design choices related to multimodal fusion strategies, multimodal data augmentation, converting tabular data into text, cross-modal alignment, and handling missing modalities. Through extensive experimentation and analysis, we distill a collection of effective strategies and consolidate them into a unified pipeline, achieving robust performance on diverse datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
