Multimodal Large Language Model for Visual Navigation

Yao-Hung Hubert Tsai; Vansh Dhar; Jialu Li; Bowen Zhang; Jian Zhang

arXiv:2310.08669·cs.CV·November 7, 2023·1 cites

Multimodal Large Language Model for Visual Navigation

Yao-Hung Hubert Tsai, Vansh Dhar, Jialu Li, Bowen Zhang, Jian Zhang

PDF

Open Access

TL;DR

This paper introduces a fine-tuned multimodal large language model for visual navigation that simplifies prompts and improves performance, reducing collisions compared to existing methods.

Contribution

The work presents a novel approach to visual navigation by fine-tuning large language models with a simple prompt and history collection, avoiding complex prompt engineering.

Findings

01

Outperforms state-of-the-art behavior cloning methods

02

Reduces collision rates in visual navigation tasks

03

Effective use of human demonstrations and collision signals

Abstract

Recent efforts to enable visual navigation using large language models have mainly focused on developing complex prompt systems. These systems incorporate instructions, observations, and history into massive text prompts, which are then combined with pre-trained large language models to facilitate visual navigation. In contrast, our approach aims to fine-tune large language models for visual navigation without extensive prompt engineering. Our design involves a simple text prompt, current observations, and a history collector model that gathers information from previous observations as input. For output, our design provides a probability distribution of possible actions that the agent can take during navigation. We train our model using human demonstrations and collision signals from the Habitat-Matterport 3D Dataset (HM3D). Experimental results demonstrate that our method outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Geographic Information Systems Studies