Tackling VQA with Pretrained Foundation Models without Further Training
Alvin De Jun Tan, Bingquan Shen

TL;DR
This paper proposes a zero-shot approach to Visual Question Answering by combining pretrained language and foundation models without additional training, using natural language to represent images for LLM understanding.
Contribution
It introduces a novel method to perform VQA without further training by translating images into natural language descriptions for LLMs, reducing computational costs.
Findings
Effective zero-shot VQA performance on VQAv2 dataset
Different decoding strategies impact accuracy
Avoids need for large-scale image-text training datasets
Abstract
Large language models (LLMs) have achieved state-of-the-art results in many natural language processing tasks. They have also demonstrated ability to adapt well to different tasks through zero-shot or few-shot settings. With the capability of these LLMs, researchers have looked into how to adopt them for use with Visual Question Answering (VQA). Many methods require further training to align the image and text embeddings. However, these methods are computationally expensive and requires large scale image-text dataset for training. In this paper, we explore a method of combining pretrained LLMs and other foundation models without further training to solve the VQA problem. The general idea is to use natural language to represent the images such that the LLM can understand the images. We explore different decoding strategies for generating textual representation of the image and evaluate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsALIGN
