Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning

Dayong Liang; Changmeng Zheng; Zhiyuan Wen; Yi Cai; Xiao-Yong Wei; Qing Li

arXiv:2505.09118·cs.CV·May 15, 2025

Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning

Dayong Liang, Changmeng Zheng, Zhiyuan Wen, Yi Cai, Xiao-Yong Wei, Qing Li

PDF

Open Access

TL;DR

This paper introduces ISGR, a novel framework that enhances vision-language models' ability to reason about complex interactions in visual scenes by combining spatial relation extraction, interaction queries, and memory reinforcement learning.

Contribution

We propose Interaction-augmented Scene Graph Reasoning (ISGR), integrating spatial, interaction-aware, and memory components to improve scene understanding and reasoning in vision-language models.

Findings

01

Significantly outperforms baseline methods on interaction-heavy reasoning benchmarks.

02

Achieves strong improvements on complex scene understanding tasks.

03

Demonstrates effective long-term interaction reasoning capabilities.

Abstract

Traditional scene graphs primarily focus on spatial relationships, limiting vision-language models' (VLMs) ability to reason about complex interactions in visual scenes. This paper addresses two key challenges: (1) conventional detection-to-construction methods produce unfocused, contextually irrelevant relationship sets, and (2) existing approaches fail to form persistent memories for generalizing interaction reasoning to new scenes. We propose Interaction-augmented Scene Graph Reasoning (ISGR), a framework that enhances VLMs' interactional reasoning through three complementary components. First, our dual-stream graph constructor combines SAM-powered spatial relation extraction with interaction-aware captioning to generate functionally salient scene graphs with spatial grounding. Second, we employ targeted interaction queries to activate VLMs' latent knowledge of object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling

MethodsFocus