Aux-Think: Exploring Reasoning Strategies for Data-Efficient Vision-Language Navigation

Shuo Wang; Yongcai Wang; Wanting Li; Xudong Cai; Yucheng Wang; Maiyue Chen; Kaihui Wang; Zhizhong Su; Deying Li; Zhaoxin Fan

arXiv:2505.11886·cs.RO·October 15, 2025

Aux-Think: Exploring Reasoning Strategies for Data-Efficient Vision-Language Navigation

Shuo Wang, Yongcai Wang, Wanting Li, Xudong Cai, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Deying Li, Zhaoxin Fan

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper investigates reasoning strategies in vision-language navigation, identifies inference-time reasoning issues, and proposes Aux-Think, a framework that improves data efficiency and performance by training models with structured reasoning supervision.

Contribution

It introduces Aux-Think, a novel training framework that internalizes reasoning patterns for VLN, and provides the first Chain-of-Thought dataset for this task.

Findings

01

Inference-time reasoning degrades navigation accuracy.

02

Aux-Think reduces training effort significantly.

03

Aux-Think achieves state-of-the-art performance with less data.

Abstract

Vision-Language Navigation (VLN) is a critical task for developing embodied agents that can follow natural language instructions to navigate in complex real-world environments. Recent advances in VLN by large pretrained models have significantly improved generalization and instruction grounding compared to traditional approaches. However, the role of reasoning strategies in navigation-an action-centric, long-horizon task-remains underexplored, despite Chain-of-Thought (CoT) reasoning's demonstrated success in static tasks like visual question answering. To address this gap, we conduct the first systematic evaluation of reasoning strategies for VLN, including No-Think (direct action prediction), Pre-Think (reason before action), and Post-Think (reason after action). Surprisingly, our findings reveal the Inference-time Reasoning Collapse issue, where inference-time reasoning degrades…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
HorizonRobotics/Aux-Think
model· 9 dl· ♡ 1
9 dl♡ 1

Datasets

HorizonRobotics/Aux-Think
dataset· 43 dl
43 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Semantic Web and Ontologies · Speech and dialogue systems