Tackling VQA with Pretrained Foundation Models without Further Training

Alvin De Jun Tan; Bingquan Shen

arXiv:2309.15487·cs.CV·September 28, 2023

Tackling VQA with Pretrained Foundation Models without Further Training

Alvin De Jun Tan, Bingquan Shen

PDF

Open Access

TL;DR

This paper proposes a zero-shot approach to Visual Question Answering by combining pretrained language and foundation models without additional training, using natural language to represent images for LLM understanding.

Contribution

It introduces a novel method to perform VQA without further training by translating images into natural language descriptions for LLMs, reducing computational costs.

Findings

01

Effective zero-shot VQA performance on VQAv2 dataset

02

Different decoding strategies impact accuracy

03

Avoids need for large-scale image-text training datasets

Abstract

Large language models (LLMs) have achieved state-of-the-art results in many natural language processing tasks. They have also demonstrated ability to adapt well to different tasks through zero-shot or few-shot settings. With the capability of these LLMs, researchers have looked into how to adopt them for use with Visual Question Answering (VQA). Many methods require further training to align the image and text embeddings. However, these methods are computationally expensive and requires large scale image-text dataset for training. In this paper, we explore a method of combining pretrained LLMs and other foundation models without further training to solve the VQA problem. The general idea is to use natural language to represent the images such that the LLM can understand the images. We explore different decoding strategies for generating textual representation of the image and evaluate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsALIGN