Resolving Positional Ambiguity in Dialogues by Vision-Language Models   for Robot Navigation

Kuan-Lin Chen; Tzu-Ti Wei; Li-Tzu Yeh; Elaine Kao; Yu-Chee Tseng; and; Jen-Jee Chen

arXiv:2410.12802·cs.RO·October 18, 2024

Resolving Positional Ambiguity in Dialogues by Vision-Language Models for Robot Navigation

Kuan-Lin Chen, Tzu-Ti Wei, Li-Tzu Yeh, Elaine Kao, Yu-Chee Tseng, and, Jen-Jee Chen

PDF

Open Access

TL;DR

This paper presents a novel approach using vision-language models to resolve positional ambiguity in natural language commands for indoor robot navigation, enabling more accurate and disambiguated navigation instructions.

Contribution

It introduces a two-level method that links language to visual object IDs and depth maps, addressing positional ambiguity in human-robot communication.

Findings

01

Effective disambiguation of commands with multiple similar objects

02

Successful mapping from language to visual object IDs and depth maps

03

First integration of foundation models for positional ambiguity resolution

Abstract

We consider an autonomous navigation robot that can accept human commands through natural language to provide services in an indoor environment. These natural language commands may include time, position, object, and action components. However, we observe that the positional components within such commands usually refer to objects in the environment that may contain different levels of positional ambiguity. For example, the command "Go to the chair!" may be ambiguous when there are multiple chairs of the same type in a room. In order to disambiguate these commands, we employ a large language model and a large vision-language model to conduct multiple turns of conversations with the user. We propose a two-level approach that utilizes a vision-language model to map the meanings in natural language to a unique object ID in images and then performs another mapping from the unique object ID…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems