Structured Observation Language for Efficient and Generalizable Vision-Language Navigation

Daojie Peng; Fulong Ma; and Jun Ma

arXiv:2603.27577·cs.CV·March 31, 2026

Structured Observation Language for Efficient and Generalizable Vision-Language Navigation

Daojie Peng, Fulong Ma, and Jun Ma

PDF

TL;DR

SOL-Nav introduces a structured language approach to vision-language navigation, converting visual observations into compact descriptions to improve efficiency and generalization across environments.

Contribution

The paper presents SOL-Nav, a novel framework that transforms visual data into structured language, reducing model size and enhancing generalization in VLN tasks.

Findings

01

Achieves strong generalization to unseen environments.

02

Reduces model size and training data dependency.

03

Performs well on standard VLN benchmarks and real-world deployments.

Abstract

Vision-Language Navigation (VLN) requires an embodied agent to navigate complex environments by following natural language instructions, which typically demands tight fusion of visual and language modalities. Existing VLN methods often convert raw images into visual tokens or implicit features, requiring large-scale visual pre-training and suffering from poor generalization under environmental variations (e.g., lighting, texture). To address these issues, we propose SOL-Nav (Structured Observation Language for Navigation), a novel framework that translates egocentric visual observations into compact structured language descriptions for efficient and generalizable navigation. Specifically, we divide RGB-D images into a N*N grid, extract representative semantic, color, and depth information for each grid cell to form structured text, and concatenate this with the language instruction as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.