Towards Solving Multimodal Comprehension

Pritish Sahu; Karan Sikka; and Ajay Divakaran

arXiv:2104.10139·cs.CL·April 21, 2021·1 cites

Towards Solving Multimodal Comprehension

Pritish Sahu, Karan Sikka, and Ajay Divakaran

PDF

Open Access

TL;DR

This paper introduces two new multimodal machine comprehension datasets, analyzes dataset biases affecting model performance, and proposes a method to mitigate bias, thereby advancing research in procedural multimodal understanding.

Contribution

The paper presents new datasets WoodworkQA and DecorationQA, analyzes inherent biases in existing datasets, and proposes an algorithm to remove bias, improving evaluation accuracy.

Findings

01

Naive baselines perform similarly to advanced models due to dataset bias.

02

Bias correction reduces model performance by 8-16%.

03

New datasets provide valuable benchmarks for multimodal comprehension.

Abstract

This paper targets the problem of procedural multimodal machine comprehension (M3C). This task requires an AI to comprehend given steps of multimodal instructions and then answer questions. Compared to vanilla machine comprehension tasks where an AI is required only to understand a textual input, procedural M3C is more challenging as the AI needs to comprehend both the temporal and causal factors along with multimodal inputs. Recently Yagcioglu et al. [35] introduced RecipeQA dataset to evaluate M3C. Our first contribution is the introduction of two new M3C datasets- WoodworkQA and DecorationQA with 16K and 10K instructional procedures, respectively. We then evaluate M3C using a textual cloze style question-answering task and highlight an inherent bias in the question answer generation method from [35] that enables a naive baseline to cheat by learning from only answer choices. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques