LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal Data

Grigor Bezirganyan; Sana Sellami; Laure Berti-\'Equille; S\'ebastien Fournier

arXiv:2406.09864·cs.LG·August 14, 2025

LUMA: A Benchmark Dataset for Learning from Uncertain and Multimodal Data

Grigor Bezirganyan, Sana Sellami, Laure Berti-\'Equille, S\'ebastien Fournier

PDF

1 Repo 1 Datasets 4 Reviews

TL;DR

LUMA is a comprehensive multimodal dataset designed to facilitate research on uncertainty in multimodal deep learning, enabling controlled experiments and benchmarking for more trustworthy AI systems.

Contribution

It introduces a novel dataset with integrated uncertainty injection tools and baseline models for evaluating uncertainty quantification methods in multimodal learning.

Findings

01

LUMA enables controlled uncertainty experiments across modalities.

02

Baseline models demonstrate the dataset's utility for benchmarking.

03

Tools support diverse data variations and out-of-distribution testing.

Abstract

Multimodal Deep Learning enhances decision-making by integrating diverse information sources, such as texts, images, audio, and videos. To develop trustworthy multimodal approaches, it is essential to understand how uncertainty impacts these models. We propose LUMA, a unique multimodal dataset, featuring audio, image, and textual data from 50 classes, specifically designed for learning from uncertain data. It extends the well-known CIFAR 10/100 dataset with audio samples extracted from three audio corpora, and text data generated using the Gemma-7B Large Language Model (LLM). The LUMA dataset enables the controlled injection of varying types and degrees of uncertainty to achieve and tailor specific experiments and benchmarking initiatives. LUMA is also available as a Python package including the functions for generating multiple variants of the dataset with controlling the diversity of…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 3

Strengths

Well known datasets from vision, text and audio are combined to create a multimodal dataset which contains the image, text and spoken word description of the image. There is a systematic process to add noise to each modality. Code is made available to add this noise. The dataset for each modality is evaluated separately using baseline classifier models.Gender, Cultural and Racial Bias in labels are carefully avoided.

Weaknesses

It is totally unclear how much use this dataset for a practical multimodal learning task. Also, this level of image with short caption or spoken word descriptions are largely bypassed by existing multimodal models. Having spoken word labels (with varying amounts of noise) to low quality CIFAR images (with varying amounts of noise) does not appear to be very useful.

Reviewer 02Rating 5Confidence 3

Strengths

(1) Careful Data Collection Design: The data collection process for image, audio, and text modalities leverages generative models to augment existing data while incorporating mechanisms to ensure the quality of the benchmark dataset. This aligned, multimodal dataset for classification serves as a valuable resource for a wide range of research applications. (2) Open-Source Python Package: This work includes a robust, open-source Python package, enabling other researchers to easily build upon and

Weaknesses

(1) Simplistic Task Design: The classification tasks in this benchmark are relatively simple and may not align with the complexities of recent multimodal research. Although the work emphasizes uncertainty analysis rather than state-of-the-art multimodal models, the current focus of the multimodal research community is on more challenging tasks that better represent real-world applications. As a result, this benchmark may have limited contributions to the fields of trustworthy and robust machine

Reviewer 03Rating 5Confidence 4

Strengths

1. This paper introduces a novel benchmark, LUMA, designed to measure uncertainty across multiple modalities: audio, image, and text. Additionally, the LUMA dataset allows for controlled manipulation of different types and levels of uncertainty. 2. The authors provide baseline pre-trained models along with three uncertainty quantification methods—Monte Carlo Dropout, Deep Ensemble, and Reliable Conflictive Multi-View Learning—offering a solid starting point for benchmarking.

Weaknesses

1. The choice of data sources for specific modalities is unclear. For instance, it’s not evident why CIFAR-10/100 was used for images rather than ImageNet, which includes 1,000 labels and could provide greater diversity for benchmarking. I suggest that the authors clarify this decision in the main paper. 2. To convert text labels into audio, the authors propose a complex approach involving mapping labels to utterance transcriptions and segmenting using forced alignment. However, a simpler alter

Reviewer 04Rating 3Confidence 3

Strengths

Overall, the paper is clearly written. However, there are some limitations.

Weaknesses

1. This paper lacks practical applications in real-world scenarios. Both the data and the noise are in a simulated environment, which is very different from real-world applications. At the same time, it collects images from CIFAR, audio using keyword spotting techniques, and text using LLM to generate short descriptions. However, I cannot imagine that there is multimodal data containing these types of data in real-world scenarios. 2. All sample noise and label noise are simulated, but no real no

Code & Models

Repositories

bezirganyan/luma
pytorchOfficial

Datasets

bezirganyan/LUMA
dataset· 28k dl
28k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDropout