SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning

Yang Liu; Ming Ma; Xiaomin Yu; Pengxiang Ding; Han Zhao; Mingyang Sun; Siteng Huang; Donglin Wang

arXiv:2505.12448·cs.CV·October 27, 2025

SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning

Yang Liu, Ming Ma, Xiaomin Yu, Pengxiang Ding, Han Zhao, Mingyang Sun, Siteng Huang, Donglin Wang

PDF

Open Access 2 Models 2 Datasets

TL;DR

This paper introduces SSR, a framework that converts raw depth data into textual rationales to improve spatial reasoning in vision-language models, supported by a new dataset and benchmark.

Contribution

The paper presents a novel method for transforming depth data into textual rationales and a knowledge distillation approach for efficient integration into existing models.

Findings

01

SSR significantly improves depth utilization in VLMs.

02

The SSR-CoT dataset enables comprehensive spatial reasoning evaluation.

03

Experiments show enhanced multi-modal understanding with SSR.

Abstract

Despite impressive advancements in Visual-Language Models (VLMs) for multi-modal tasks, their reliance on RGB inputs limits precise spatial understanding. Existing methods for integrating spatial cues, such as point clouds or depth, either require specialized sensors or fail to effectively exploit depth information for higher-order reasoning. To this end, we propose a novel Spatial Sense and Reasoning method, dubbed SSR, a novel framework that transforms raw depth data into structured, interpretable textual rationales. These textual rationales serve as meaningful intermediate representations to significantly enhance spatial reasoning capabilities. Additionally, we leverage knowledge distillation to compress the generated rationales into compact latent embeddings, which facilitate resource-efficient and plug-and-play integration into existing VLMs without retraining. To enable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Constraint Satisfaction and Optimization

MethodsKnowledge Distillation