Multimodal Inverse Cloze Task for Knowledge-based Visual Question   Answering

Paul Lerner; Olivier Ferret; Camille Guinaudeau

arXiv:2301.04366·cs.CL·January 12, 2023

Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering

Paul Lerner, Olivier Ferret, Camille Guinaudeau

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel pre-training method called Multimodal Inverse Cloze Task for improving knowledge-based visual question answering about named entities, leveraging multimodal data to enhance retrieval and comprehension.

Contribution

It proposes a new pre-training task that combines textual and visual information, improving performance in knowledge-based visual question answering tasks.

Findings

01

Achieves a 9% relative-MRR improvement over baseline.

02

Attains a 15% relative-F1 gain in reading comprehension.

03

Applicable across different neural network architectures.

Abstract

We present a new pre-training method, Multimodal Inverse Cloze Task, for Knowledge-based Visual Question Answering about named Entities (KVQAE). KVQAE is a recently introduced task that consists in answering questions about named entities grounded in a visual context using a Knowledge Base. Therefore, the interaction between the modalities is paramount to retrieve information and must be captured with complex fusion models. As these models require a lot of training data, we design this pre-training task from existing work in textual Question Answering. It consists in considering a sentence as a pseudo-question and its context as a pseudo-relevant passage and is extended by considering images near texts in multimodal documents. Our method is applicable to different neural network architectures and leads to a 9% relative-MRR and 15% relative-F1 gain for retrieval and reading…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

paullerner/viquae
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling