SmoGVLM: A Small, Graph-enhanced Vision-Language Model
Debjyoti Mondal, Rituraj Singh, Subhadarshi Panda

TL;DR
SmoGVLM is a small, graph-enhanced vision-language model that integrates structured knowledge to improve reasoning, outperforming larger models and baselines.
Contribution
Introducing SmoGVLM, a novel small VLM that leverages graph neural networks for knowledge integration, enhancing performance across various model sizes.
Findings
Small models with SmoGVLM outperform larger models and baselines.
Performance gains of up to 16.24% achieved with the proposed method.
Structured knowledge augmentation benefits multimodal reasoning.
Abstract
Large vision-language models (VLMs) achieve strong performance on multimodal tasks but often suffer from hallucination and poor grounding in knowledge-intensive reasoning. We propose SmoGVLM, a small, graph-enhanced VLM that integrates structured knowledge with visual and textual modalities, using Graph Neural Networks. We investigate the effects of our method across a range of model sizes, from tiny (1.3B) to large (13B) models. Our results demonstrate that, when trained using our approach, a small model can achieve performance gains upto 16.24%, and surpass its larger counterparts, outperforming larger VLMs and strong fine-tuned baselines. These results highlight the potential of structured knowledge augmentation for efficient, smaller-scale multimodal reasoning systems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
