SceneAlign: Aligning Multimodal Reasoning to Scene Graphs in Complex Visual Scenes

Chuhan Wang; Xintong Li; Jennifer Yuntong Zhang; Junda Wu; Chengkai Huang; Lina Yao; Julian McAuley; Jingbo Shang

arXiv:2601.05600·cs.CV·January 12, 2026

SceneAlign: Aligning Multimodal Reasoning to Scene Graphs in Complex Visual Scenes

Chuhan Wang, Xintong Li, Jennifer Yuntong Zhang, Junda Wu, Chengkai Huang, Lina Yao, Julian McAuley, Jingbo Shang

PDF

Open Access

TL;DR

SceneAlign enhances multimodal reasoning in complex visual scenes by aligning language models with scene graphs, reducing hallucinations and improving answer accuracy through structured visual grounding and contrastive training.

Contribution

It introduces a novel framework that uses scene graphs for controllable structural interventions and contrastive training to improve reasoning faithfulness in multimodal models.

Findings

01

Consistently improves answer accuracy across seven benchmarks.

02

Reduces hallucinated entities and mis-grounded relations.

03

Enhances reasoning faithfulness in complex visual scenes.

Abstract

Multimodal large language models often struggle with faithful reasoning in complex visual scenes, where intricate entities and relations require precise visual grounding at each step. This reasoning unfaithfulness frequently manifests as hallucinated entities, mis-grounded relations, skipped steps, and over-specified reasoning. Existing preference-based approaches, typically relying on textual perturbations or answer-conditioned rationales, fail to address this challenge as they allow models to exploit language priors to bypass visual grounding. To address this, we propose SceneAlign, a framework that leverages scene graphs as structured visual information to perform controllable structural interventions. By identifying reasoning-critical nodes and perturbing them through four targeted strategies that mimic typical grounding failures, SceneAlign constructs hard negative rationales that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks