SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning
Byungwoo Jeon, Dongyoung Kim, Huiwon Jang, Insoo Kim, Jinwoo Shin

TL;DR
SpatialBoost enhances 2D vision models by integrating 3D spatial knowledge via language, significantly improving their performance on perception benchmarks through a hierarchical reasoning process.
Contribution
It introduces a scalable framework that injects 3D spatial understanding into pre-trained vision encoders using language and multi-turn reasoning.
Findings
Improves DINOv3 mIoU on ADE20K from 55.9 to 59.7.
Achieves state-of-the-art performance with 3.8% gain.
Demonstrates effectiveness across various vision benchmarks.
Abstract
Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications. To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively…
Peer Reviews
Decision·Submitted to ICLR 2026
* The main idea is interesting and clear: using language-based supervision to improve spatial understanding in vision encoders. * The method is easy to apply on top of existing pre-trained models, which makes it practical and useful for the community. * The experiments are extensive, covering many benchmarks involving both 3D perception and general vision tasks. * The results suggest that language supervision can transfer structured spatial knowledge into dense prediction tasks, which is an int
* Although the hypothesis is interesting, it is not fully supported by the results. In Table 6, the LLM-based fine-tuning appears roughly on par with some dense prediction baselines, so the advantage of the language-based supervision is not clearly demonstrated. * The comparison in Table 6 is also difficult to interpret because the baselines are not trained on the same data. A fair comparison on the exact same subset & same amount of data would make the claim stronger. * Many performance gains i
The work is motivated by a well-identified gap in 2D vision encoders—their inherent lack of 3D spatial understanding. The proposed approach of converting 3D information into linguistic expressions is innovative.
- The spatial CoT data is generated using GPT-4o. There lacks of evaluation regarding the quality of this synthetic data. GPT-4o may suffer from significant hallucinations on spatial-related questions, which could compromise data reliability. - Experiments are conducted only on vision encoders like DINOv2, OpenCLIP, and the Qwen2.0-7B LLM. The method lacks comparison with recent state-of-the-art MLLMs such as Qwen2.5-VL or InternVL3. It remains unclear whether the approach would be effective wh
- The proposed method of using LLM supervision to enhance the spatial information for visual representations is simple but effective. The motivation is reasonable, and the solution is natural and intuitive. - After applying the feature finetuning method SpatialBoost, the visual representations have universal improved performance on various types of tasks including dense predictions, 3D understanding, robot learning, and even high-level tasks like image classification and retrieval. It is very c
- I think it is mostly a well-written paper with effective solutions and comprehensive experimental evaluations. One thing that I find a bit confusing is the categorization on the multi-turn spatial reasoning task. In Figure 2, there is a pyramid showing three levels of spatial knowledge being scene-level QA, object-level QA, and pixel-level QA. However, I do not quite agree that the corresponding examples on the right (the three QA examples in Figure 2 corresponds to the left using the color ma
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · 3D Shape Modeling and Analysis
