Fine-Grained Knowledge Structuring and Retrieval for Visual Question Answering
Zhengxuan Zhang, Yin Wu, Yuyu Luo, Nan Tang

TL;DR
This paper introduces a structured approach to organize and retrieve fine-grained multimodal knowledge units for VQA, significantly improving retrieval accuracy and reasoning in multimodal large language models.
Contribution
It proposes a novel knowledge structuring method and a retrieval-augmented framework (KU-RAG) that enhances VQA performance by systematic knowledge management and integration.
Findings
Outperforms existing KB-VQA methods across four benchmarks
Achieves an average of 3% performance improvement, up to 11% in best case
Enhances reasoning capabilities through knowledge correction chain
Abstract
Visual Question Answering (VQA) focuses on providing answers to natural language questions by utilizing information from images. Although cutting-edge multimodal large language models (MLLMs) such as GPT-4o achieve strong performance on VQA tasks, they frequently fall short in accessing domain-specific or the latest knowledge. To mitigate this issue, retrieval-augmented generation (RAG) leveraging external knowledge bases (KBs), referred to as KB-VQA, emerges as a promising approach. Nevertheless, conventional unimodal retrieval techniques, which translate images into textual descriptions, often result in the loss of critical visual details. To address these challenges, this study presents two key innovations. First, we introduce fine-grained knowledge units that consist of multimodal data fragments (e.g. text fragments, entity images, and so on) in a structured manner. Rather than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Graph Neural Networks
