Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering
Paul Lerner, Olivier Ferret, Camille Guinaudeau

TL;DR
This paper introduces a novel pre-training method called Multimodal Inverse Cloze Task for improving knowledge-based visual question answering about named entities, leveraging multimodal data to enhance retrieval and comprehension.
Contribution
It proposes a new pre-training task that combines textual and visual information, improving performance in knowledge-based visual question answering tasks.
Findings
Achieves a 9% relative-MRR improvement over baseline.
Attains a 15% relative-F1 gain in reading comprehension.
Applicable across different neural network architectures.
Abstract
We present a new pre-training method, Multimodal Inverse Cloze Task, for Knowledge-based Visual Question Answering about named Entities (KVQAE). KVQAE is a recently introduced task that consists in answering questions about named entities grounded in a visual context using a Knowledge Base. Therefore, the interaction between the modalities is paramount to retrieve information and must be captured with complex fusion models. As these models require a lot of training data, we design this pre-training task from existing work in textual Question Answering. It consists in considering a sentence as a pseudo-question and its context as a pseudo-relevant passage and is extended by considering images near texts in multimodal documents. Our method is applicable to different neural network architectures and leads to a 9% relative-MRR and 15% relative-F1 gain for retrieval and reading…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling
