Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization Task
Hassan Ali, Philipp Allgeuer, Stefan Wermter

TL;DR
This paper explores using large language models to predict human intentions in a collaborative object categorization task with a robot, integrating multimodal cues for improved interaction.
Contribution
It introduces a novel multimodal hierarchical approach leveraging LLMs to infer human intentions from verbal and non-verbal cues in human-robot collaboration.
Findings
LLMs can effectively reason about multimodal user cues
The approach improves intention prediction accuracy
Leveraging context and real-world knowledge enhances interaction
Abstract
Human intention-based systems enable robots to perceive and interpret user actions to interact with humans and adapt to their behavior proactively. Therefore, intention prediction is pivotal in creating a natural interaction with social robots in human-designed environments. In this paper, we examine using Large Language Models (LLMs) to infer human intention in a collaborative object categorization task with a physical robot. We propose a novel multimodal approach that integrates user non-verbal cues, like hand gestures, body poses, and facial expressions, with environment states and user verbal cues to predict user intentions in a hierarchical architecture. Our evaluation of five LLMs shows the potential for reasoning about verbal and non-verbal user cues, leveraging their context-understanding and real-world knowledge to support intention prediction while collaborating on a task with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Speech and dialogue systems
