GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning

Jiajin Liu; Dongzhe Fan; Chuanhao Ji; Daochen Zha; Qiaoyu Tan

arXiv:2603.13370·cs.CV·March 17, 2026

GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning

Jiajin Liu, Dongzhe Fan, Chuanhao Ji, Daochen Zha, Qiaoyu Tan

PDF

Open Access

TL;DR

GraphVLM introduces a comprehensive benchmark to evaluate vision-language models for structured multimodal graph reasoning, demonstrating their potential across various paradigms and datasets.

Contribution

This work systematically benchmarks VLMs for multimodal graph learning, exploring three integration paradigms and revealing the effectiveness of VLMs as a foundation for structured multimodal reasoning.

Findings

01

VLMs improve multimodal graph learning across multiple datasets.

02

VLM-as-Predictor yields the best performance among paradigms.

03

The benchmark code is publicly available for further research.

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in aligning and understanding multimodal signals, yet their potential to reason over structured data, where multimodal entities are connected through explicit relational graphs, remains largely underexplored. Unlocking this capability is crucial for real-world applications such as social networks, recommendation systems, and scientific discovery, where multimodal information is inherently structured. To bridge this gap, we present GraphVLM, a systematic benchmark designed to evaluate and harness the capabilities of VLMs for multimodal graph learning (MMGL). GraphVLM investigates three complementary paradigms for integrating VLMs with graph reasoning: (1) VLM-as-Encoder, which enriches graph neural networks through multimodal feature fusion; (2) VLM-as-Aligner, which bridges modalities in latent or linguistic space…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Multimodal Machine Learning Applications · Topic Modeling