Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models
Jean Park, Kuk Jin Jang, Basam Alasaly, Sriharsha Mopidevi, Andrew, Zolensky, Eric Eaton, Insup Lee, Kevin Johnson

TL;DR
This paper introduces the modality importance score (MIS) to detect modality bias in video question-answering datasets, revealing that current datasets are often unimodal and limiting multimodal reasoning in large language models.
Contribution
The paper proposes a novel MIS metric using state-of-the-art MLLMs to identify modality bias and guide the creation of more balanced multimodal datasets.
Findings
Current datasets exhibit significant unimodal bias.
MLLMs perform poorly on permuted feature sets, indicating limited multimodal integration.
MIS can effectively guide dataset curation for better multimodal learning.
Abstract
Multimodal large language models (MLLMs) can simultaneously process visual, textual, and auditory data, capturing insights that complement human analysis. However, existing video question-answering (VidQA) benchmarks and datasets often exhibit a bias toward a single modality, despite the goal of requiring advanced reasoning skills that integrate diverse modalities to answer the queries. In this work, we introduce the modality importance score (MIS) to identify such bias. It is designed to assess which modality embeds the necessary information to answer the question. Additionally, we propose an innovative method using state-of-the-art MLLMs to estimate the modality importance, which can serve as a proxy for human judgments of modality perception. With this MIS, we demonstrate the presence of unimodal bias and the scarcity of genuinely multimodal questions in existing datasets. We further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
