TL;DR
mKG-RAG introduces a novel framework that integrates multimodal knowledge graphs into retrieval-augmented generation for improved knowledge-intensive visual question answering, enhancing accuracy and reliability.
Contribution
It proposes a new method combining multimodal KGs with RAG, utilizing graph extraction and a dual-stage retrieval to improve VQA performance.
Findings
Outperforms existing methods on knowledge-based VQA tasks.
Achieves new state-of-the-art results.
Effectively leverages structured multimodal knowledge.
Abstract
Retrieval-Augmented Generation (RAG) has emerged as an effective paradigm for expanding the knowledge capacity of Multimodal Large Language Models (MLLMs) by incorporating external knowledge sources into the generation process, and has been widely adopted for knowledge-based Visual Question Answering (VQA). Despite impressive advancements, vanilla RAG-based VQA methods that rely on unstructured documents and overlook the structural relations among knowledge elements frequently introduce irrelevant or misleading content, degrading answer accuracy and reliability. To overcome these challenges, a promising solution is to integrate multimodal knowledge graphs (KGs) into RAG-based VQA frameworks, thereby enhancing generation through structured multimodal knowledge. To this end, this paper proposes mKG-RAG, a novel retrieval-augmented generation framework built upon multimodal KGs for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
