MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding
Revanth Gangi Reddy, Xilin Rui, Manling Li, Xudong Lin, Haoyang Wen,, Jaemin Cho, Lifu Huang, Mohit Bansal, Avirup Sil, Shih-Fu Chang, Alexander, Schwing, Heng Ji

TL;DR
This paper introduces MuMuQA, a new multimedia question answering benchmark involving cross-media reasoning over news articles with images and text, along with a data augmentation framework to improve model training.
Contribution
It presents a novel benchmark for cross-media QA in news, and a data augmentation method for weak supervision, advancing research in multimedia reasoning.
Findings
Models perform well but still lag behind humans.
The benchmark reveals challenges in cross-media grounding.
Data augmentation improves model training effectiveness.
Abstract
Recently, there has been an increasing interest in building question answering (QA) models that reason across multiple modalities, such as text and images. However, QA using images is often limited to just picking the answer from a pre-defined set of options. In addition, images in the real world, especially in news, have objects that are co-referential to the text, with complementary information from both modalities. In this paper, we present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text. Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question. In addition, we introduce a novel multimedia data augmentation framework, based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling
