Embodied Navigation with Auxiliary Task of Action Description Prediction

Haru Kondoh; Asako Kanezaki

arXiv:2510.21809·cs.CV·October 28, 2025

Embodied Navigation with Auxiliary Task of Action Description Prediction

Haru Kondoh, Asako Kanezaki

PDF

TL;DR

This paper introduces an auxiliary language description task into reinforcement learning for indoor robot navigation, enhancing explainability without sacrificing performance, and achieves state-of-the-art results in multimodal navigation.

Contribution

It proposes a novel method to incorporate action description prediction into reinforcement learning using knowledge distillation, improving explainability and performance in multimodal navigation.

Findings

01

Effective action description generation during navigation.

02

Achieved state-of-the-art in semantic audio-visual navigation.

03

Maintained high navigation performance with explainability.

Abstract

The field of multimodal robot navigation in indoor environments has garnered significant attention in recent years. However, as tasks and methods become more advanced, the action decision systems tend to become more complex and operate as black-boxes. For a reliable system, the ability to explain or describe its decisions is crucial; however, there tends to be a trade-off in that explainable systems can not outperform non-explainable systems in terms of performance. In this paper, we propose incorporating the task of describing actions in language into the reinforcement learning of navigation as an auxiliary task. Existing studies have found it difficult to incorporate describing actions into reinforcement learning due to the absence of ground-truth data. We address this issue by leveraging knowledge distillation from pre-trained description generation models, such as vision-language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.