CoNVOI: Context-aware Navigation using Vision Language Models in Outdoor and Indoor Environments
Adarsh Jagan Sathyamoorthy, Kasun Weerakoon, Mohamed Elnoor, Anuj, Zore, Brian Ichter, Fei Xia, Jie Tan, Wenhao Yu, Dinesh Manocha

TL;DR
This paper introduces ConVOI, a novel approach that leverages Vision Language Models for autonomous robot navigation in diverse indoor and outdoor environments by understanding context, reasoning about trajectories, and guiding motion planning with minimal queries.
Contribution
ConVOI is the first method to integrate VLMs for context-aware navigation, combining zero-shot scene classification, semantic reasoning, and a novel multi-modal visual marking technique.
Findings
Effective context recognition in indoor and outdoor scenes
Navigation behaviors resemble human-like decisions
Reduced VLM queries through trajectory extrapolation
Abstract
We present ConVOI, a novel method for autonomous robot navigation in real-world indoor and outdoor environments using Vision Language Models (VLMs). We employ VLMs in two ways: first, we leverage their zero-shot image classification capability to identify the context or scenario (e.g., indoor corridor, outdoor terrain, crosswalk, etc) of the robot's surroundings, and formulate context-based navigation behaviors as simple text prompts (e.g. ``stay on the pavement"). Second, we utilize their state-of-the-art semantic understanding and logical reasoning capabilities to compute a suitable trajectory given the identified context. To this end, we propose a novel multi-modal visual marking approach to annotate the obstacle-free regions in the RGB image used as input to the VLM with numbers, by correlating it with a local occupancy map of the environment. The marked numbers ground image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Geographic Information Systems Studies · Advanced Image and Video Retrieval Techniques
