Instruction-as-State: Environment-Guided and State-Conditioned Semantic Understanding for Embodied Navigation

Zhen Liu; Yuhan Liu; Jinjun Wang; Jianyi Liu; Wei Song; Jingwen Fu

arXiv:2604.18223·cs.CV·April 21, 2026

Instruction-as-State: Environment-Guided and State-Conditioned Semantic Understanding for Embodied Navigation

Zhen Liu, Yuhan Liu, Jinjun Wang, Jianyi Liu, Wei Song, Jingwen Fu

PDF

TL;DR

This paper introduces a dynamic instruction understanding framework for embodied navigation, modeling instructions as evolving states conditioned on the agent's perceptual context to improve navigation performance.

Contribution

It proposes the S-EGIU framework that dynamically updates instruction semantics based on visual context, enhancing adaptability over static instruction encoding methods.

Findings

01

Achieves +2.68% SPL gain on REVERIE Test Unseen

02

Demonstrates consistent efficiency improvements across VLN benchmarks

03

Models instruction understanding as a step-by-step evolving state

Abstract

Vision-and-Language Navigation requires agents to follow natural-language instructions in visually changing environments. A central challenge is the dynamic entanglement between language and observations: the meaning of instruction shifts as the agent's field of view and spatial context evolve. However, many existing models encode the instruction as a static global representation, limiting their ability to adapt instruction meaning to the current visual context. We therefore model instruction understanding as an Instruction-as-State variable: a decision-relevant, token-level instruction state that evolves step by step conditioned on the agent's perceptual state, where the perceptual state denotes the observation-grounded navigation context at each step. To realize this principle, we introduce State-Entangled Environment-Guided Instruction Understanding (S-EGIU), a coarse-to-fine…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.