SmoGVLM: A Small, Graph-enhanced Vision-Language Model

Debjyoti Mondal; Rituraj Singh; Subhadarshi Panda

arXiv:2604.16517·cs.CV·April 21, 2026

SmoGVLM: A Small, Graph-enhanced Vision-Language Model

Debjyoti Mondal, Rituraj Singh, Subhadarshi Panda

PDF

TL;DR

SmoGVLM is a small, graph-enhanced vision-language model that integrates structured knowledge to improve reasoning, outperforming larger models and baselines.

Contribution

Introducing SmoGVLM, a novel small VLM that leverages graph neural networks for knowledge integration, enhancing performance across various model sizes.

Findings

01

Small models with SmoGVLM outperform larger models and baselines.

02

Performance gains of up to 16.24% achieved with the proposed method.

03

Structured knowledge augmentation benefits multimodal reasoning.

Abstract

Large vision-language models (VLMs) achieve strong performance on multimodal tasks but often suffer from hallucination and poor grounding in knowledge-intensive reasoning. We propose SmoGVLM, a small, graph-enhanced VLM that integrates structured knowledge with visual and textual modalities, using Graph Neural Networks. We investigate the effects of our method across a range of model sizes, from tiny (1.3B) to large (13B) models. Our results demonstrate that, when trained using our approach, a small model can achieve performance gains upto 16.24%, and surpass its larger counterparts, outperforming larger VLMs and strong fine-tuned baselines. These results highlight the potential of structured knowledge augmentation for efficient, smaller-scale multimodal reasoning systems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.