Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation

Swagat Padhan; Lakshya Jain; Bhavya Minesh Shah; Omkar Patil; Thao Nguyen; Nakul Gopalan

arXiv:2603.19166·cs.RO·March 20, 2026

Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation

Swagat Padhan, Lakshya Jain, Bhavya Minesh Shah, Omkar Patil, Thao Nguyen, Nakul Gopalan

PDF

Open Access

TL;DR

This paper introduces MAPG, a multi-agent probabilistic framework that improves vision-language grounding for robot navigation by decomposing language into components and probabilistically combining their grounded outputs, especially for metric-semantic tasks.

Contribution

The paper proposes MAPG, a novel agentic framework that enhances metric-semantic grounding in vision-language navigation by decomposing and probabilistically combining language components.

Findings

01

MAPG outperforms strong baselines on HM-EQA benchmark.

02

Introduction of MAPG-Bench for metric-semantic goal evaluation.

03

Successful real-world robot demonstration with MAPG.

Abstract

Robots collaborating with humans must convert natural language goals into actionable, physically grounded decisions. For example, executing a command such as "go two meters to the right of the fridge" requires grounding semantic references, spatial relations, and metric constraints within a 3D scene. While recent vision language models (VLMs) demonstrate strong semantic grounding capabilities, they are not explicitly designed to reason about metric constraints in physically defined spaces. In this work, we empirically demonstrate that state-of-the-art VLM-based grounding approaches struggle with complex metric-semantic language queries. To address this limitation, we propose MAPG (Multi-Agent Probabilistic Grounding), an agentic framework that decomposes language queries into structured subcomponents and queries a VLM to ground each component. MAPG then probabilistically composes these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Robot Manipulation and Learning