Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction   in an Object Categorization Task

Hassan Ali; Philipp Allgeuer; Stefan Wermter

arXiv:2404.08424·cs.RO·April 9, 2025·1 cites

Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization Task

Hassan Ali, Philipp Allgeuer, Stefan Wermter

PDF

Open Access

TL;DR

This paper explores using large language models to predict human intentions in a collaborative object categorization task with a robot, integrating multimodal cues for improved interaction.

Contribution

It introduces a novel multimodal hierarchical approach leveraging LLMs to infer human intentions from verbal and non-verbal cues in human-robot collaboration.

Findings

01

LLMs can effectively reason about multimodal user cues

02

The approach improves intention prediction accuracy

03

Leveraging context and real-world knowledge enhances interaction

Abstract

Human intention-based systems enable robots to perceive and interpret user actions to interact with humans and adapt to their behavior proactively. Therefore, intention prediction is pivotal in creating a natural interaction with social robots in human-designed environments. In this paper, we examine using Large Language Models (LLMs) to infer human intention in a collaborative object categorization task with a physical robot. We propose a novel multimodal approach that integrates user non-verbal cues, like hand gestures, body poses, and facial expressions, with environment states and user verbal cues to predict user intentions in a hierarchical architecture. Our evaluation of five LLMs shows the potential for reasoning about verbal and non-verbal user cues, leveraging their context-understanding and real-world knowledge to support intention prediction while collaborating on a task with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Speech and dialogue systems