Image Captioning for Effective Use of Language Models in Knowledge-Based   Visual Question Answering

Ander Salaberria; Gorka Azkune; Oier Lopez de Lacalle; Aitor Soroa,; Eneko Agirre

arXiv:2109.08029·cs.CV·September 14, 2022

Image Captioning for Effective Use of Language Models in Knowledge-Based Visual Question Answering

Ander Salaberria, Gorka Azkune, Oier Lopez de Lacalle, Aitor Soroa,, Eneko Agirre

PDF

Open Access 1 Repo

TL;DR

This paper introduces a text-only approach for visual question answering that leverages automatic image captioning and pretrained language models, demonstrating superior performance on knowledge-dependent tasks compared to multimodal models.

Contribution

The authors propose a novel text-only method for VQA that outperforms comparable multimodal models on external knowledge tasks, highlighting the effectiveness of language models in this domain.

Findings

01

Text-only models outperform multimodal models on OK-VQA.

02

Increasing language model size improves performance significantly.

03

Automatic captions often miss relevant image information.

Abstract

Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models. Our results on a visual question answering task which requires external knowledge (OK-VQA) show that our text-only model outperforms pretrained multimodal (image-text) models of comparable number of parameters. In contrast, our model is less effective in a standard VQA task (VQA 2.0) confirming that our text-only method is specially effective for tasks requiring external knowledge. In addition, we show that increasing the language model's size improves notably its performance, yielding results comparable to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

salanueva/CBM
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization