GoalVLM: VLM-driven Object Goal Navigation for Multi-Agent System
MoniJesu James, Amir Atef Habel, Aleksey Fedoseev, and Dzmitry Tsetserokou

TL;DR
GoalVLM introduces a multi-agent framework that leverages vision-language models and spatial reasoning for zero-shot, open-vocabulary object navigation, outperforming prior methods without task-specific training.
Contribution
It integrates VLMs with spatial and semantic reasoning for multi-agent object navigation, enabling zero-shot generalization to novel goals.
Findings
Achieves 55.8% subtask success rate on GOAT-Bench
Outperforms state-of-the-art methods without training
Validates importance of VLM-guided reasoning and localization
Abstract
Object-goal navigation has traditionally been limited to ground robots with closed-set object vocabularies. Existing multi-agent approaches depend on precomputed probabilistic graphs tied to fixed category sets, precluding generalization to novel goals at test time. We present GoalVLM, a cooperative multi-agent framework for zero-shot, open-vocabulary object navigation. GoalVLM integrates a Vision-Language Model (VLM) directly into the decision loop, SAM3 for text-prompted detection and segmentation, and SpaceOM for spatial reasoning, enabling agents to interpret free-form language goals and score frontiers via zero-shot semantic priors without retraining. Each agent builds a BEV semantic map from depth-projected voxel splatting, while a Goal Projector back-projects detections through calibrated depth into the map for reliable goal localization. A constraint-guided reasoning layer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
