Learning to Search: A Decision-Based Agent for Knowledge-Based Visual Question Answering
Zhuohong Chen, Zhenxian Wu, Yunyao Yu, Hangrui Xu, Zirui Liao, Zhifang Liu, Xiangwen Deng, Pen Jiao, Haoqian Wang

TL;DR
This paper introduces a decision-based agent approach for knowledge-based visual question answering, enabling multi-step reasoning and retrieval actions to improve answer accuracy.
Contribution
It reformulates KB-VQA as a multi-step decision process, allowing dynamic action selection and better alignment of retrieved evidence with questions.
Findings
Achieves state-of-the-art results on InfoSeek and E-VQA datasets.
Outperforms prior methods in accuracy and reasoning ability.
Demonstrates effective multi-step decision-making in KB-VQA.
Abstract
Knowledge-based visual question answering (KB-VQA) requires vision-language models to understand images and use external knowledge, especially for rare entities and long-tail facts. Most existing retrieval-augmented generation (RAG) methods adopt a fixed pipeline that sequentially retrieves information, filters it, and then produces an answer. Such a design makes it difficult to adapt to diverse question types. Moreover, it separates retrieval from reasoning, making it hard for the model to decide when to search, how to refine queries, or when to stop. As a result, the retrieved evidence is often poorly aligned with the question. To address these limitations, we reformulate KB-VQA as a search-agent problem and model the solving process as a multi-step decision-making procedure. At each step, the agent selects one of four actions-Answer, Image Retrieval, Text Retrieval, and Caption-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
