SIFThinker: Spatially-Aware Image Focus for Visual Reasoning

Zhangquan Chen; Ruihui Zhao; Chuwei Luo; Mingze Sun; Xinlei Yu; Yangyang Kang; Ruqi Huang

arXiv:2508.06259·cs.CV·December 29, 2025

SIFThinker: Spatially-Aware Image Focus for Visual Reasoning

Zhangquan Chen, Ruihui Zhao, Chuwei Luo, Mingze Sun, Xinlei Yu, Yangyang Kang, Ruqi Huang

PDF

Open Access 2 Datasets

TL;DR

SIFThinker introduces a spatially-aware framework for visual reasoning that enhances attention correction and region focusing using depth-enhanced cues, leading to improved spatial understanding in multimodal models.

Contribution

It presents a novel reverse-expansion inference strategy and a reinforced training paradigm that incorporate depth-informed visual grounding for better spatial reasoning.

Findings

01

Outperforms state-of-the-art in spatial understanding tasks

02

Constructs the SIF-50K dataset for process supervision

03

Demonstrates strong generalization capabilities

Abstract

Current multimodal large language models (MLLMs) still face significant challenges in complex visual tasks (e.g., spatial understanding, fine-grained perception). Prior methods have tried to incorporate visual reasoning, however, they fail to leverage attention correction with spatial cues to iteratively refine their focus on prompt-relevant regions. In this paper, we introduce SIFThinker, a spatially-aware "think-with-images" framework that mimics human visual perception. Specifically, SIFThinker enables attention correcting and image region focusing by interleaving depth-enhanced bounding boxes and natural language. Our contributions are twofold: First, we introduce a reverse-expansion-forward-inference strategy that facilitates the generation of interleaved image-text chains of thought for process-level supervision, which in turn leads to the construction of the SIF-50K dataset.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Vision and Imaging