ProFocus: Proactive Perception and Focused Reasoning in Vision-and-Language Navigation
Wei Xue, Mingcheng Li, Xuecheng Wu, Jingqun Tang, Dingkang Yang, Lihua Zhang

TL;DR
ProFocus introduces a proactive perception and focused reasoning framework for vision-and-language navigation, leveraging structured semantic maps and selective historical reasoning to improve efficiency and accuracy in complex environments.
Contribution
It presents a training-free, collaborative approach combining LLMs and VLMs for structured perception and targeted reasoning, advancing zero-shot VLN performance.
Findings
Achieves state-of-the-art zero-shot results on R2R and REVERIE benchmarks.
Effectively transforms panoramic views into semantic maps for better perception.
Focuses reasoning on high-value waypoints, reducing computational complexity.
Abstract
Vision-and-Language Navigation (VLN) requires agents to accurately perceive complex visual environments and reason over navigation instructions and histories. However, existing methods passively process redundant visual inputs and treat all historical contexts indiscriminately, resulting in inefficient perception and unfocused reasoning. To address these challenges, we propose \textbf{ProFocus}, a training-free progressive framework that unifies \underline{Pro}active Perception and \underline{Focus}ed Reasoning through collaboration between large language models (LLMs) and vision-language models (VLMs). For proactive perception, ProFocus transforms panoramic observations into structured ego-centric semantic maps, enabling the orchestration agent to identify missing visual information needed for reliable decision-making, and to generate targeted visual queries with corresponding focus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
