GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning

Fengyi Wu; Yifei Dong; Yilong Dai; Guangyu Chen; Qifeng Wu; Huiting Huang; Hang Wang; Qi Dai; Alexander G. Hauptmann; Zhi-Qi Cheng

arXiv:2508.09547·cs.CV·April 30, 2026

GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning

Fengyi Wu, Yifei Dong, Yilong Dai, Guangyu Chen, Qifeng Wu, Huiting Huang, Hang Wang, Qi Dai, Alexander G. Hauptmann, Zhi-Qi Cheng

PDF

TL;DR

GoViG introduces a novel task of generating navigation instructions from egocentric visual data, employing multimodal reasoning and a new dataset to improve adaptability and performance in visual navigation.

Contribution

The paper presents a new task, a multimodal reasoning framework, and a dataset for goal-conditioned visual instruction generation from raw visual observations.

Findings

01

Significant improvements in BLEU-4 and CIDEr scores over existing methods.

02

Robust cross-domain generalization demonstrated in experiments.

03

Effective multimodal reasoning strategies enhance instruction quality.

Abstract

We introduce Goal-Conditioned Visual Navigation Instruction Generation (GoViG), a new task that aims to generate contextually coherent navigation instructions solely from egocentric visual observations of initial and goal states. Unlike prior work relying on structured inputs, such as semantic annotations or environmental maps, GoViG exclusively leverages raw egocentric visual data, improving adaptability to unseen and unstructured environments. Our method addresses this task by decomposing it into two interconnected subtasks: (1) navigation visualization, predicting intermediate visual states bridging the initial and goal views; and (2) instruction generation, synthesizing coherent instructions grounded in observed and anticipated visuals. Both subtasks are integrated within an autoregressive multimodal LLM trained with tailored objectives to ensure spatial accuracy and linguistic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.