Vision-Language Navigation with Energy-Based Policy

Rui Liu; Wenguan Wang; Yi Yang

arXiv:2410.14250·cs.CV·October 21, 2024

Vision-Language Navigation with Energy-Based Policy

Rui Liu, Wenguan Wang, Yi Yang

PDF

Open Access

TL;DR

This paper introduces an energy-based policy for vision-language navigation that models the joint state-action distribution to better imitate expert behavior and reduce errors, improving performance across multiple benchmarks.

Contribution

The paper proposes an energy-based navigation policy that aligns with expert behavior by modeling joint state-action distributions, offering a novel approach to VLN.

Findings

01

Achieves promising results on R2R, REVERIE, RxR, and R2R-CE datasets.

02

Effectively models the joint distribution of states and actions.

03

Improves upon existing VLN models by reducing error accumulation.

Abstract

Vision-language navigation (VLN) requires an agent to execute actions following human instructions. Existing VLN models are optimized through expert demonstrations by supervised behavioural cloning or incorporating manual reward engineering. While straightforward, these efforts overlook the accumulation of errors in the Markov decision process, and struggle to match the distribution of the expert policy. Going beyond this, we propose an Energy-based Navigation Policy (ENP) to model the joint state-action distribution using an energy-based model. At each step, low energy values correspond to the state-action pairs that the expert is most likely to perform, and vice versa. Theoretically, the optimization objective is equivalent to minimizing the forward divergence between the occupancy measure of the expert and ours. Consequently, ENP learns to globally align with the expert policy by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMiddle East and Rwanda Conflicts

MethodsALIGN