IntenBot: Flexible and Imprecise Multimodal Input for LLMs to Understand User Intentions for Casual and Human-Like HRI
Yen-Ting Liu, Chiu-Hsuan Wang, TzuLing Chen, Ting-Ying Lee, Tzu-Hua Wang, Chien-Ming Lin, Bing-Yu Chen, Hsin-Ruey Tsai

TL;DR
IntenBot enables human-like, flexible interaction with robots by understanding imprecise multimodal inputs like voice, gaze, and pointing, using LLMs for disambiguation in XR environments.
Contribution
The paper introduces IntenBot, a novel system that interprets casual, imprecise multimodal user inputs for human-robot interaction, leveraging LLMs for filtering and disambiguation.
Findings
IntenBot effectively filters irrelevant input modalities using LLMs.
User study reveals natural multimodal interaction behaviors.
IntenBot performs well in XR and real-world robot deployments.
Abstract
In natural human-to-human communication, multimodal user input is typically used to supplement explicit and complement implicit voice commands, with casualness allowing for flexible input modality combinations and tolerance for imprecise input data. For example, saying "I want that." with a casual glance at a bottle of water is clear enough in human-to-human communication as an implicit voice command accompanied by gaze and/or gestures, rather than an explicit one. To enable such a human-like interaction in human-robot interaction (HRI), we propose a system, IntenBot, to understand user intentions from flexible and imprecise multimodal input, including voice, gaze, and finger-pointing, in XR. The disambiguation capability of large language models (LLMs) is used to filter out irrelevant input modalities and imprecise input data, generating potential instructions for user confirmation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
