LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal Data
Grigor Bezirganyan, Sana Sellami, Laure Berti-\'Equille, S\'ebastien Fournier

TL;DR
LUMA is a comprehensive multimodal dataset designed to facilitate research on uncertainty in multimodal deep learning, enabling controlled experiments and benchmarking for more trustworthy AI systems.
Contribution
It introduces a novel dataset with integrated uncertainty injection tools and baseline models for evaluating uncertainty quantification methods in multimodal learning.
Findings
LUMA enables controlled uncertainty experiments across modalities.
Baseline models demonstrate the dataset's utility for benchmarking.
Tools support diverse data variations and out-of-distribution testing.
Abstract
Multimodal Deep Learning enhances decision-making by integrating diverse information sources, such as texts, images, audio, and videos. To develop trustworthy multimodal approaches, it is essential to understand how uncertainty impacts these models. We propose LUMA, a unique multimodal dataset, featuring audio, image, and textual data from 50 classes, specifically designed for learning from uncertain data. It extends the well-known CIFAR 10/100 dataset with audio samples extracted from three audio corpora, and text data generated using the Gemma-7B Large Language Model (LLM). The LUMA dataset enables the controlled injection of varying types and degrees of uncertainty to achieve and tailor specific experiments and benchmarking initiatives. LUMA is also available as a Python package including the functions for generating multiple variants of the dataset with controlling the diversity of…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
Well known datasets from vision, text and audio are combined to create a multimodal dataset which contains the image, text and spoken word description of the image. There is a systematic process to add noise to each modality. Code is made available to add this noise. The dataset for each modality is evaluated separately using baseline classifier models.Gender, Cultural and Racial Bias in labels are carefully avoided.
It is totally unclear how much use this dataset for a practical multimodal learning task. Also, this level of image with short caption or spoken word descriptions are largely bypassed by existing multimodal models. Having spoken word labels (with varying amounts of noise) to low quality CIFAR images (with varying amounts of noise) does not appear to be very useful.
(1) Careful Data Collection Design: The data collection process for image, audio, and text modalities leverages generative models to augment existing data while incorporating mechanisms to ensure the quality of the benchmark dataset. This aligned, multimodal dataset for classification serves as a valuable resource for a wide range of research applications. (2) Open-Source Python Package: This work includes a robust, open-source Python package, enabling other researchers to easily build upon and
(1) Simplistic Task Design: The classification tasks in this benchmark are relatively simple and may not align with the complexities of recent multimodal research. Although the work emphasizes uncertainty analysis rather than state-of-the-art multimodal models, the current focus of the multimodal research community is on more challenging tasks that better represent real-world applications. As a result, this benchmark may have limited contributions to the fields of trustworthy and robust machine
1. This paper introduces a novel benchmark, LUMA, designed to measure uncertainty across multiple modalities: audio, image, and text. Additionally, the LUMA dataset allows for controlled manipulation of different types and levels of uncertainty. 2. The authors provide baseline pre-trained models along with three uncertainty quantification methods—Monte Carlo Dropout, Deep Ensemble, and Reliable Conflictive Multi-View Learning—offering a solid starting point for benchmarking.
1. The choice of data sources for specific modalities is unclear. For instance, it’s not evident why CIFAR-10/100 was used for images rather than ImageNet, which includes 1,000 labels and could provide greater diversity for benchmarking. I suggest that the authors clarify this decision in the main paper. 2. To convert text labels into audio, the authors propose a complex approach involving mapping labels to utterance transcriptions and segmenting using forced alignment. However, a simpler alter
Overall, the paper is clearly written. However, there are some limitations.
1. This paper lacks practical applications in real-world scenarios. Both the data and the noise are in a simulated environment, which is very different from real-world applications. At the same time, it collects images from CIFAR, audio using keyword spotting techniques, and text using LLM to generate short descriptions. However, I cannot imagine that there is multimodal data containing these types of data in real-world scenarios. 2. All sample noise and label noise are simulated, but no real no
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDropout
