Mario: Multimodal Graph Reasoning with Large Language Models
Yuanfu Sun, Kang Li, Pengkang Guo, Jiajin Liu, Qiaoyu Tan

TL;DR
Mario introduces a novel framework for multimodal graph reasoning that leverages large language models to better understand complex relationships in image-text data, addressing cross-modal consistency and heterogeneity.
Contribution
The paper presents a unified approach combining graph-conditioned vision-language modeling and modality-adaptive instruction tuning for improved multimodal graph reasoning.
Findings
Mario outperforms state-of-the-art models in node classification.
Mario achieves superior results in link prediction tasks.
The framework is effective in both supervised and zero-shot scenarios.
Abstract
Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Topic Modeling
