VISOR: VIsual Spatial Object Reasoning for Language-driven Object Navigation
Francesco Taioli, Shiping Yang, Sonia Raychaudhuri, Marco Cristani, Unnat Jain, Angel X Chang

TL;DR
This paper introduces VISOR, a compact vision-language-action agent that performs explicit, image-grounded reasoning for language-driven object navigation, improving explainability and generalization over existing methods.
Contribution
The paper presents a novel 3B-parameter VLA agent that replaces multi-model pipelines with explicit reasoning stages for better interpretability and efficiency in object navigation.
Findings
Enhanced explainability through explicit reasoning stages
Improved generalization to unseen objects and environments
More efficient navigation compared to prior methods
Abstract
Language-driven object navigation requires agents to interpret natural language descriptions of target objects, which combine intrinsic and extrinsic attributes for instance recognition and commonsense navigation. Existing methods either (i) use end-to-end trained models with vision-language embeddings, which struggle to generalize beyond training data and lack action-level explainability, or (ii) rely on modular zero-shot pipelines with large language models (LLMs) and open-set object detectors, which suffer from error propagation, high computational cost, and difficulty integrating their reasoning back into the navigation policy. To this end, we propose a compact 3B-parameter Vision-Language-Action (VLA) agent that performs human-like embodied reasoning for both object recognition and action selection, removing the need for stitched multi-model pipelines. Instead of raw embedding…
Peer Reviews
Decision·Submitted to ICLR 2026
- **Unified and Compact Model:** VISOR is implemented as a single, end-to-end model with 3B parameters, removing the need for large, external object detectors or segmented multi-model pipelines. This directly addresses one of the field's notable practical bottlenecks. - **Explicit Reasoning and Explainability:** The agent generates detailed reasoning traces (“<think>”, “<think_summary>”, and “<action>”), providing action-level explainability—an advance over black-box action selection models. - *
1. **Empirical Performance Lagging on Seen Categories:** - VISOR persistently lags behind the strongest baseline methods (e.g., RL, DAgRL, Uni-NaVid, and MTU3D) in raw performance measures (SPL, SR) on Val Seen and Synonym splits (see Table 3). For example, on OVON Val Seen, VISOR (GSPO) achieves SPL=12.48 / SR=21.7 vs DAgRL’s SPL=21.2 / SR=41.3 and MTU3D’s SPL=23.6 / SR=55.0, indicating that its effectiveness in familiar settings is limited. - The rationale for the lower numbers, richer,
- Building an intelligent navigation model is a long-standing goal for the community of VLMs and embodied AI. This paper proposes CURE properties that VISOR possesses: compact, unified, reasoning-capable, and explainable. - Endowing navigation models with thinking capability is a good practice. This facilitates the learning of navigation task and enhances the explainability, and the authors provide some representative qualitative studies and discussions owing to this advantage. - This paper coll
- I have some concerns about the input modality that make the comparison unfair. VISOR incorporates BEV image as input, which provides global information and could mitigate the challenge of navigation task. - The results are not strong enough compared to recent navigation models such as Uni-NaVid and MTU3D. For example, using DINO features should not be an excuse, as VISOR leverages pretrained Qwen2.5-VL model in turn. - Oracle stop can be used for analysis but should not be used for fair compar
1. The approach makes it easy for practitioners to understand the model’s outputs and the reasons behind them via structured reasoning tags. 2. The paper introduces waypoint-level supervision to support training reasoning-capable embodied navigation agents, addressing a gap not covered by existing datasets. 3. The method and experimental conclusions are described clearly, making the overall contribution easy to follow.
1. The paper should clearly argue—ideally with ablations—the necessity of using three cameras/panoramic inputs for the proposed method. 2. The absolute performance lags SOTA. On OVON, for example, *Val Seen* SR: DAgRL 41.3 vs. VISOR-GSPO 21.7; *Val Unseen* SR: MTU3D 40.8 vs. VISOR-GSPO 22.0. On CoIN-Bench, VISOR-GSPO improves over SFT but SR remains modest overall. 3. The paper should quantify the contributions of the `<think>` and `<think summary>` components to final performance, and provide d
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Topic Modeling
