CoNVOI: Context-aware Navigation using Vision Language Models in Outdoor   and Indoor Environments

Adarsh Jagan Sathyamoorthy; Kasun Weerakoon; Mohamed Elnoor; Anuj; Zore; Brian Ichter; Fei Xia; Jie Tan; Wenhao Yu; Dinesh Manocha

arXiv:2403.15637·cs.RO·March 26, 2024·3 cites

CoNVOI: Context-aware Navigation using Vision Language Models in Outdoor and Indoor Environments

Adarsh Jagan Sathyamoorthy, Kasun Weerakoon, Mohamed Elnoor, Anuj, Zore, Brian Ichter, Fei Xia, Jie Tan, Wenhao Yu, Dinesh Manocha

PDF

Open Access

TL;DR

This paper introduces ConVOI, a novel approach that leverages Vision Language Models for autonomous robot navigation in diverse indoor and outdoor environments by understanding context, reasoning about trajectories, and guiding motion planning with minimal queries.

Contribution

ConVOI is the first method to integrate VLMs for context-aware navigation, combining zero-shot scene classification, semantic reasoning, and a novel multi-modal visual marking technique.

Findings

01

Effective context recognition in indoor and outdoor scenes

02

Navigation behaviors resemble human-like decisions

03

Reduced VLM queries through trajectory extrapolation

Abstract

We present ConVOI, a novel method for autonomous robot navigation in real-world indoor and outdoor environments using Vision Language Models (VLMs). We employ VLMs in two ways: first, we leverage their zero-shot image classification capability to identify the context or scenario (e.g., indoor corridor, outdoor terrain, crosswalk, etc) of the robot's surroundings, and formulate context-based navigation behaviors as simple text prompts (e.g. ``stay on the pavement"). Second, we utilize their state-of-the-art semantic understanding and logical reasoning capabilities to compute a suitable trajectory given the identified context. To this end, we propose a novel multi-modal visual marking approach to annotate the obstacle-free regions in the RGB image used as input to the VLM with numbers, by correlating it with a local occupancy map of the environment. The marked numbers ground image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Geographic Information Systems Studies · Advanced Image and Video Retrieval Techniques