Fine-Grained Knowledge Structuring and Retrieval for Visual Question Answering

Zhengxuan Zhang; Yin Wu; Yuyu Luo; Nan Tang

arXiv:2502.20964·cs.CV·July 9, 2025

Fine-Grained Knowledge Structuring and Retrieval for Visual Question Answering

Zhengxuan Zhang, Yin Wu, Yuyu Luo, Nan Tang

PDF

Open Access

TL;DR

This paper introduces a structured approach to organize and retrieve fine-grained multimodal knowledge units for VQA, significantly improving retrieval accuracy and reasoning in multimodal large language models.

Contribution

It proposes a novel knowledge structuring method and a retrieval-augmented framework (KU-RAG) that enhances VQA performance by systematic knowledge management and integration.

Findings

01

Outperforms existing KB-VQA methods across four benchmarks

02

Achieves an average of 3% performance improvement, up to 11% in best case

03

Enhances reasoning capabilities through knowledge correction chain

Abstract

Visual Question Answering (VQA) focuses on providing answers to natural language questions by utilizing information from images. Although cutting-edge multimodal large language models (MLLMs) such as GPT-4o achieve strong performance on VQA tasks, they frequently fall short in accessing domain-specific or the latest knowledge. To mitigate this issue, retrieval-augmented generation (RAG) leveraging external knowledge bases (KBs), referred to as KB-VQA, emerges as a promising approach. Nevertheless, conventional unimodal retrieval techniques, which translate images into textual descriptions, often result in the loss of critical visual details. To address these challenges, this study presents two key innovations. First, we introduce fine-grained knowledge units that consist of multimodal data fragments (e.g. text fragments, entity images, and so on) in a structured manner. Rather than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Graph Neural Networks