Towards Solving Multimodal Comprehension
Pritish Sahu, Karan Sikka, and Ajay Divakaran

TL;DR
This paper introduces two new multimodal machine comprehension datasets, analyzes dataset biases affecting model performance, and proposes a method to mitigate bias, thereby advancing research in procedural multimodal understanding.
Contribution
The paper presents new datasets WoodworkQA and DecorationQA, analyzes inherent biases in existing datasets, and proposes an algorithm to remove bias, improving evaluation accuracy.
Findings
Naive baselines perform similarly to advanced models due to dataset bias.
Bias correction reduces model performance by 8-16%.
New datasets provide valuable benchmarks for multimodal comprehension.
Abstract
This paper targets the problem of procedural multimodal machine comprehension (M3C). This task requires an AI to comprehend given steps of multimodal instructions and then answer questions. Compared to vanilla machine comprehension tasks where an AI is required only to understand a textual input, procedural M3C is more challenging as the AI needs to comprehend both the temporal and causal factors along with multimodal inputs. Recently Yagcioglu et al. [35] introduced RecipeQA dataset to evaluate M3C. Our first contribution is the introduction of two new M3C datasets- WoodworkQA and DecorationQA with 16K and 10K instructional procedures, respectively. We then evaluate M3C using a textual cloze style question-answering task and highlight an inherent bias in the question answer generation method from [35] that enables a naive baseline to cheat by learning from only answer choices. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
