Take That for Me: Multimodal Exophora Resolution with Interactive Questioning for Ambiguous Out-of-View Instructions

Akira Oyama; Shoichi Hasegawa; Akira Taniguchi; Yoshinobu Hagiwara; Tadahiro Taniguchi

arXiv:2508.16143·cs.RO·August 25, 2025

Take That for Me: Multimodal Exophora Resolution with Interactive Questioning for Ambiguous Out-of-View Instructions

Akira Oyama, Shoichi Hasegawa, Akira Taniguchi, Yoshinobu Hagiwara, Tadahiro Taniguchi

PDF

TL;DR

This paper introduces MIEL, a multimodal framework enabling robots to resolve ambiguous exophoric instructions by combining sound localization, semantic mapping, visual-language models, and interactive questioning, especially when objects or users are out of view.

Contribution

The paper presents a novel multimodal exophora resolution framework that integrates sound source localization, semantic mapping, visual-language models, and GPT-4o for interactive clarification in real-world scenarios.

Findings

01

Approximately 1.3 times better performance when user is visible.

02

Approximately 2.0 times better performance when user is out of view.

03

Effective use of interactive questioning to resolve ambiguities.

Abstract

Daily life support robots must interpret ambiguous verbal instructions involving demonstratives such as ``Bring me that cup,'' even when objects or users are out of the robot's view. Existing approaches to exophora resolution primarily rely on visual data and thus fail in real-world scenarios where the object or user is not visible. We propose Multimodal Interactive Exophora resolution with user Localization (MIEL), which is a multimodal exophora resolution framework leveraging sound source localization (SSL), semantic mapping, visual-language models (VLMs), and interactive questioning with GPT-4o. Our approach first constructs a semantic map of the environment and estimates candidate objects from a linguistic query with the user's skeletal data. SSL is utilized to orient the robot toward users who are initially outside its visual field, enabling accurate identification of user gestures…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.