MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs
Xueyao Wan, Hang Yu

TL;DR
MMGraphRAG is a novel framework that combines visual scene graphs with text knowledge graphs using spectral clustering and path-based retrieval, enhancing multimodal reasoning and reducing hallucinations in language models.
Contribution
It introduces a new cross-modal fusion method, SpecLink, and releases the CMEL dataset for fine-grained multimodal entity alignment, advancing multimodal knowledge integration.
Findings
Achieves state-of-the-art performance on multiple benchmarks.
Demonstrates robust domain adaptability.
Enhances multimodal information processing capabilities.
Abstract
Large Language Models (LLMs) often suffer from hallucinations, which Retrieval-Augmented Generation (RAG) and GraphRAG mitigate by incorporating external knowledge and knowledge graphs (KGs). However, GraphRAG remains text-centric due to the difficulty of constructing fine-grained Multimodal KGs (MMKGs). Existing fusion methods, such as shared embeddings or captioning, require task-specific training and fail to preserve visual structural knowledge or cross-modal reasoning paths. To bridge this gap, we propose MMGraphRAG, which integrates visual scene graphs with text KGs via a novel cross-modal fusion approach. It introduces SpecLink, a method leveraging spectral clustering for accurate cross-modal entity linking and path-based retrieval to guide generation. We also release the CMEL dataset, specifically designed for fine-grained multi-entity alignment in complex multimodal scenarios.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
