Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training
Anthony Meng Huat Tiong, Junnan Li, Boyang Li, Silvio Savarese, Steven, C.H. Hoi

TL;DR
This paper introduces Plug-and-Play VQA, a zero-shot visual question answering framework that leverages large pretrained models without additional training, using natural language and network interpretation as intermediate representations.
Contribution
The authors propose a modular, training-free approach that combines pretrained language models with image captioning for zero-shot VQA, outperforming existing end-to-end trained models.
Findings
Achieves state-of-the-art zero-shot VQA results on VQAv2 and GQA datasets.
Outperforms larger models like Flamingo with fewer parameters.
Demonstrates effectiveness of natural language as an intermediate representation.
Abstract
Visual question answering (VQA) is a hallmark of vision and language reasoning and a challenging task under the zero-shot setting. We propose Plug-and-Play VQA (PNP-VQA), a modular framework for zero-shot VQA. In contrast to most existing works, which require substantial adaptation of pretrained language models (PLMs) for the vision modality, PNP-VQA requires no additional training of the PLMs. Instead, we propose to use natural language and network interpretation as an intermediate representation that glues pretrained models together. We first generate question-guided informative image captions, and pass the captions to a PLM as context for question answering. Surpassing end-to-end trained baselines, PNP-VQA achieves state-of-the-art results on zero-shot VQAv2 and GQA. With 11B parameters, it outperforms the 80B-parameter Flamingo model by 8.5% on VQAv2. With 738M PLM parameters,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
