Assessing Modality Bias in Video Question Answering Benchmarks with   Multimodal Large Language Models

Jean Park; Kuk Jin Jang; Basam Alasaly; Sriharsha Mopidevi; Andrew; Zolensky; Eric Eaton; Insup Lee; Kevin Johnson

arXiv:2408.12763·cs.LG·December 23, 2024

Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models

Jean Park, Kuk Jin Jang, Basam Alasaly, Sriharsha Mopidevi, Andrew, Zolensky, Eric Eaton, Insup Lee, Kevin Johnson

PDF

Open Access 1 Video

TL;DR

This paper introduces the modality importance score (MIS) to detect modality bias in video question-answering datasets, revealing that current datasets are often unimodal and limiting multimodal reasoning in large language models.

Contribution

The paper proposes a novel MIS metric using state-of-the-art MLLMs to identify modality bias and guide the creation of more balanced multimodal datasets.

Findings

01

Current datasets exhibit significant unimodal bias.

02

MLLMs perform poorly on permuted feature sets, indicating limited multimodal integration.

03

MIS can effectively guide dataset curation for better multimodal learning.

Abstract

Multimodal large language models (MLLMs) can simultaneously process visual, textual, and auditory data, capturing insights that complement human analysis. However, existing video question-answering (VidQA) benchmarks and datasets often exhibit a bias toward a single modality, despite the goal of requiring advanced reasoning skills that integrate diverse modalities to answer the queries. In this work, we introduce the modality importance score (MIS) to identify such bias. It is designed to assess which modality embeds the necessary information to answer the question. Additionally, we propose an innovative method using state-of-the-art MLLMs to estimate the modality importance, which can serve as a proxy for human judgments of modality perception. With this MIS, we demonstrate the presence of unimodal bias and the scarcity of genuinely multimodal questions in existing datasets. We further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Assessing Modality Bias in Video Question Answering Benchmarks with Multimodal Large Language Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning