Read, Look or Listen? What's Needed for Solving a Multimodal Dataset
Netta Madvil, Yonatan Bitton, Roy Schwartz

TL;DR
This paper introduces a two-step analysis method for multimodal datasets using minimal human annotation, revealing modality importance, biases, and limitations in existing datasets like TVQA and MERLOT Reserve, and proposing a new challenging test set.
Contribution
The paper presents a novel two-step approach to analyze multimodal datasets, highlighting modality reliance and proposing a new test set to evaluate multimodal integration.
Findings
Most questions in TVQA can be answered using a single modality.
Over 70% of questions are solvable with multiple single-modality strategies.
Existing datasets show limited multimodal integration and struggle with certain question types.
Abstract
The prevalence of large-scale multimodal datasets presents unique challenges in assessing dataset quality. We propose a two-step method to analyze multimodal datasets, which leverages a small seed of human annotation to map each multimodal instance to the modalities required to process it. Our method sheds light on the importance of different modalities in datasets, as well as the relationship between them. We apply our approach to TVQA, a video question-answering dataset, and discover that most questions can be answered using a single modality, without a substantial bias towards any specific modality. Moreover, we find that more than 70% of the questions are solvable using several different single-modality strategies, e.g., by either looking at the video or listening to the audio, highlighting the limited integration of multiple modalities in TVQA. We leverage our annotation and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech Recognition and Synthesis · Speech and Audio Processing
