Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation

Quanxing Xu; Ling Zhou; Xian Zhong; Xiaohua Huang; Rubing Huang; Chia-Wen Lin

arXiv:2605.03790·cs.CV·May 6, 2026

Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation

Quanxing Xu, Ling Zhou, Xian Zhong, Xiaohua Huang, Rubing Huang, Chia-Wen Lin

PDF

TL;DR

This paper introduces a novel retrieval-augmented framework for multimodal large language models to improve visual question answering by integrating structured reasoning and external knowledge.

Contribution

It proposes a new logical prompting strategy combining Chain-of-Thought and Visual Question Decomposition to guide knowledge retrieval in VQA tasks.

Findings

01

Enhanced accuracy on E-VQA, InfoSeek, and OKVQA benchmarks.

02

Improved reasoning coherence and knowledge relevance in VQA.

03

Better generalization in complex cross-domain scenarios.

Abstract

With advances in multimodal research and deep learning, Multimodal Large Language Models (MLLMs) have emerged as a powerful paradigm for a wide range of multimodal tasks. As a core problem in vision-language research, Visual Question Answering (VQA) has increasingly employed MLLMs to improve performance, particularly in open-domain settings where external knowledge is essential. In this work, we aim to further enhance retrieval-based VQA by more effectively integrating MLLMs with structured reasoning and knowledge acquisition. We introduce a logical prompting strategy that fuses Chain-of-Thought (CoT) reasoning with Visual Question Decomposition (VQD), termed CoVQD, to guide retrieval toward more accurate and relevant knowledge for MLLM inference. Building on this idea, we propose a new framework, CoVQD-guided RAG (CgRAG), which enables MLLMs to access more comprehensive and coherent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.