SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning

Byungwoo Jeon; Dongyoung Kim; Huiwon Jang; Insoo Kim; Jinwoo Shin

arXiv:2603.22057·cs.CV·March 24, 2026

SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning

Byungwoo Jeon, Dongyoung Kim, Huiwon Jang, Insoo Kim, Jinwoo Shin

PDF

Open Access 3 Reviews

TL;DR

SpatialBoost enhances 2D vision models by integrating 3D spatial knowledge via language, significantly improving their performance on perception benchmarks through a hierarchical reasoning process.

Contribution

It introduces a scalable framework that injects 3D spatial understanding into pre-trained vision encoders using language and multi-turn reasoning.

Findings

01

Improves DINOv3 mIoU on ADE20K from 55.9 to 59.7.

02

Achieves state-of-the-art performance with 3.8% gain.

03

Demonstrates effectiveness across various vision benchmarks.

Abstract

Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications. To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

* The main idea is interesting and clear: using language-based supervision to improve spatial understanding in vision encoders. * The method is easy to apply on top of existing pre-trained models, which makes it practical and useful for the community. * The experiments are extensive, covering many benchmarks involving both 3D perception and general vision tasks. * The results suggest that language supervision can transfer structured spatial knowledge into dense prediction tasks, which is an int

Weaknesses

* Although the hypothesis is interesting, it is not fully supported by the results. In Table 6, the LLM-based fine-tuning appears roughly on par with some dense prediction baselines, so the advantage of the language-based supervision is not clearly demonstrated. * The comparison in Table 6 is also difficult to interpret because the baselines are not trained on the same data. A fair comparison on the exact same subset & same amount of data would make the claim stronger. * Many performance gains i

Reviewer 02Rating 2Confidence 4

Strengths

The work is motivated by a well-identified gap in 2D vision encoders—their inherent lack of 3D spatial understanding. The proposed approach of converting 3D information into linguistic expressions is innovative.

Weaknesses

- The spatial CoT data is generated using GPT-4o. There lacks of evaluation regarding the quality of this synthetic data. GPT-4o may suffer from significant hallucinations on spatial-related questions, which could compromise data reliability. - Experiments are conducted only on vision encoders like DINOv2, OpenCLIP, and the Qwen2.0-7B LLM. The method lacks comparison with recent state-of-the-art MLLMs such as Qwen2.5-VL or InternVL3. It remains unclear whether the approach would be effective wh

Reviewer 03Rating 8Confidence 4

Strengths

- The proposed method of using LLM supervision to enhance the spatial information for visual representations is simple but effective. The motivation is reasonable, and the solution is natural and intuitive. - After applying the feature finetuning method SpatialBoost, the visual representations have universal improved performance on various types of tasks including dense predictions, 3D understanding, robot learning, and even high-level tasks like image classification and retrieval. It is very c

Weaknesses

- I think it is mostly a well-written paper with effective solutions and comprehensive experimental evaluations. One thing that I find a bit confusing is the categorization on the multi-turn spatial reasoning task. In Figure 2, there is a pyramid showing three levels of spatial knowledge being scene-level QA, object-level QA, and pixel-level QA. However, I do not quite agree that the corresponding examples on the right (the three QA examples in Figure 2 corresponds to the left using the color ma

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · 3D Shape Modeling and Analysis