Instruction-as-State: Environment-Guided and State-Conditioned Semantic Understanding for Embodied Navigation
Zhen Liu, Yuhan Liu, Jinjun Wang, Jianyi Liu, Wei Song, Jingwen Fu

TL;DR
This paper introduces a dynamic instruction understanding framework for embodied navigation, modeling instructions as evolving states conditioned on the agent's perceptual context to improve navigation performance.
Contribution
It proposes the S-EGIU framework that dynamically updates instruction semantics based on visual context, enhancing adaptability over static instruction encoding methods.
Findings
Achieves +2.68% SPL gain on REVERIE Test Unseen
Demonstrates consistent efficiency improvements across VLN benchmarks
Models instruction understanding as a step-by-step evolving state
Abstract
Vision-and-Language Navigation requires agents to follow natural-language instructions in visually changing environments. A central challenge is the dynamic entanglement between language and observations: the meaning of instruction shifts as the agent's field of view and spatial context evolve. However, many existing models encode the instruction as a static global representation, limiting their ability to adapt instruction meaning to the current visual context. We therefore model instruction understanding as an Instruction-as-State variable: a decision-relevant, token-level instruction state that evolves step by step conditioned on the agent's perceptual state, where the perceptual state denotes the observation-grounded navigation context at each step. To realize this principle, we introduce State-Entangled Environment-Guided Instruction Understanding (S-EGIU), a coarse-to-fine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
