KG-ViP: Bridging Knowledge Grounding and Visual Perception in Multi-modal LLMs for Visual Question Answering

Zhiyang Li; Ao Ke; Yukun Cao; Xike Xie

arXiv:2601.11632·cs.CV·April 22, 2026

KG-ViP: Bridging Knowledge Grounding and Visual Perception in Multi-modal LLMs for Visual Question Answering

Zhiyang Li, Ao Ke, Yukun Cao, Xike Xie

PDF

TL;DR

KG-ViP is a unified framework that enhances multi-modal large language models for visual question answering by integrating scene graphs and commonsense graphs to improve reasoning and perception.

Contribution

It introduces a novel retrieval-and-fusion pipeline that combines scene and commonsense graphs, addressing knowledge hallucination and perception issues in VQA.

Findings

01

Significantly outperforms existing VQA methods on FVQA 2.0+ and MVQA benchmarks.

02

Effectively fuses external knowledge and visual details for better reasoning.

03

Demonstrates the synergy of combining scene graphs and commonsense graphs.

Abstract

Multi-modal Large Language Models (MLLMs) for Visual Question Answering (VQA) often suffer from dual limitations: knowledge hallucination and insufficient fine-grained visual perception. Crucially, we identify that commonsense graphs and scene graphs provide precisely complementary solutions to these respective deficiencies by providing rich external knowledge and capturing fine-grained visual details. However, prior works typically treat them in isolation, overlooking their synergistic potential. To bridge this gap, we propose KG-ViP, a unified framework that empowers MLLMs by fusing scene graphs and commonsense graphs. The core of the KG-ViP framework is a novel retrieval-and-fusion pipeline that utilizes the query as a semantic bridge to progressively integrate both graphs, synthesizing a unified structured context that facilitates reliable multi-modal reasoning. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.