KG-ViP: Bridging Knowledge Grounding and Visual Perception in Multi-modal LLMs for Visual Question Answering
Zhiyang Li, Ao Ke, Yukun Cao, Xike Xie

TL;DR
KG-ViP is a unified framework that enhances multi-modal large language models for visual question answering by integrating scene graphs and commonsense graphs to improve reasoning and perception.
Contribution
It introduces a novel retrieval-and-fusion pipeline that combines scene and commonsense graphs, addressing knowledge hallucination and perception issues in VQA.
Findings
Significantly outperforms existing VQA methods on FVQA 2.0+ and MVQA benchmarks.
Effectively fuses external knowledge and visual details for better reasoning.
Demonstrates the synergy of combining scene graphs and commonsense graphs.
Abstract
Multi-modal Large Language Models (MLLMs) for Visual Question Answering (VQA) often suffer from dual limitations: knowledge hallucination and insufficient fine-grained visual perception. Crucially, we identify that commonsense graphs and scene graphs provide precisely complementary solutions to these respective deficiencies by providing rich external knowledge and capturing fine-grained visual details. However, prior works typically treat them in isolation, overlooking their synergistic potential. To bridge this gap, we propose KG-ViP, a unified framework that empowers MLLMs by fusing scene graphs and commonsense graphs. The core of the KG-ViP framework is a novel retrieval-and-fusion pipeline that utilizes the query as a semantic bridge to progressively integrate both graphs, synthesizing a unified structured context that facilitates reliable multi-modal reasoning. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
