Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge
Xingyu Fu, Sheng Zhang, Gukyeong Kwon, Pramuditha Perera and, Henghui Zhu, Yuhao Zhang, Alexander Hanbo Li, William Yang Wang and, Zhiguo Wang, Vittorio Castelli, Patrick Ng, Dan Roth, Bing Xiang

TL;DR
This paper introduces RASO, a new VQA approach that generates multiple candidate answers using a language model and then selects the correct one, significantly improving knowledge coverage and state-of-the-art performance on OK-VQA.
Contribution
RASO is the first to use a generate-then-select strategy guided by world knowledge for open-ended VQA, enhancing knowledge coverage and accuracy.
Findings
RASO improves state-of-the-art by 4.1% on OK-VQA.
The generate-then-select pipeline expands knowledge coverage.
RASO does not increase computational cost.
Abstract
The open-ended Visual Question Answering (VQA) task requires AI models to jointly reason over visual and natural language inputs using world knowledge. Recently, pre-trained Language Models (PLM) such as GPT-3 have been applied to the task and shown to be powerful world knowledge sources. However, these methods suffer from low knowledge coverage caused by PLM bias -- the tendency to generate certain tokens over other tokens regardless of prompt changes, and high dependency on the PLM quality -- only models using GPT-3 can achieve the best result. To address the aforementioned challenges, we propose RASO: a new VQA pipeline that deploys a generate-then-select strategy guided by world knowledge for the first time. Rather than following the de facto standard to train a multi-modal model that directly generates the VQA answer, RASO first adopts PLM to generate all the possible answers,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Multi-Head Attention · Attention Is All You Need · Weight Decay · Cosine Annealing · Attention Dropout · Layer Normalization · Byte Pair Encoding
