TL;DR
M$^3$KG-RAG enhances multimodal reasoning in large language models by retrieving and grounding multi-hop multimodal knowledge from knowledge graphs, improving answer faithfulness and relevance.
Contribution
It introduces a multi-hop multimodal knowledge graph and a grounding and pruning method to improve multimodal retrieval-augmented generation.
Findings
Significant improvement in multimodal reasoning accuracy.
Enhanced grounding and relevance in generated responses.
Effective multi-hop knowledge retrieval from MMKGs.
Abstract
Retrieval-Augmented Generation (RAG) has recently been extended to multimodal settings, connecting multimodal large language models (MLLMs) with vast corpora of external knowledge such as multimodal knowledge graphs (MMKGs). Despite their recent success, multimodal RAG in the audio-visual domain remains challenging due to 1) limited modality coverage and multi-hop connectivity of existing MMKGs, and 2) retrieval based solely on similarity in a shared multimodal embedding space, which fails to filter out off-topic or redundant knowledge. To address these limitations, we propose MKG-RAG, a Multi-hop Multimodal Knowledge Graph-enhanced RAG that retrieves query-aligned audio-visual knowledge from MMKGs, improving reasoning depth and answer faithfulness in MLLMs. Specifically, we devise a lightweight multi-agent pipeline to construct multi-hop MMKG (MKG), which contains…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
