Bag of Tricks for Multimodal AutoML with Image, Text, and Tabular Data

Zhiqiang Tang; Zihan Zhong; Tong He; Gerald Friedland

arXiv:2412.16243·cs.LG·December 24, 2024

Bag of Tricks for Multimodal AutoML with Image, Text, and Tabular Data

Zhiqiang Tang, Zihan Zhong, Tong He, Gerald Friedland

PDF

Open Access

TL;DR

This paper explores best practices for multimodal AutoML involving image, text, and tabular data, proposing a unified pipeline based on extensive experiments on a new benchmark of 22 diverse datasets.

Contribution

It introduces a comprehensive benchmark for multimodal AutoML and identifies effective strategies, consolidating them into a unified pipeline for improved performance.

Findings

01

Effective multimodal fusion strategies identified

02

Data augmentation methods enhance model robustness

03

Tabular-to-text conversion improves integration

Abstract

This paper studies the best practices for automatic machine learning (AutoML). While previous AutoML efforts have predominantly focused on unimodal data, the multimodal aspect remains under-explored. Our study delves into classification and regression problems involving flexible combinations of image, text, and tabular data. We curate a benchmark comprising 22 multimodal datasets from diverse real-world applications, encompassing all 4 combinations of the 3 modalities. Across this benchmark, we scrutinize design choices related to multimodal fusion strategies, multimodal data augmentation, converting tabular data into text, cross-modal alignment, and handling missing modalities. Through extensive experimentation and analysis, we distill a collection of effective strategies and consolidate them into a unified pipeline, achieving robust performance on diverse datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques