Gesture-Informed Robot Assistance via Foundation Models
Li-Heng Lin, Yuchen Cui, Yilun Hao, Fei Xia, Dorsa Sadigh

TL;DR
GIRAF leverages large language models to interpret human gestures and language instructions, significantly improving robot understanding and collaboration in tabletop tasks through flexible, context-aware reasoning.
Contribution
The paper introduces GIRAF, a novel framework that uses large language models for flexible gesture and instruction interpretation in human-robot interaction.
Findings
70% higher success rate than baseline in gesture interpretation
81% success rate on diverse gesture-based task planning
Effective and user-preferred in human-robot collaboration
Abstract
Gestures serve as a fundamental and significant mode of non-verbal communication among humans. Deictic gestures (such as pointing towards an object), in particular, offer valuable means of efficiently expressing intent in situations where language is inaccessible, restricted, or highly specialized. As a result, it is essential for robots to comprehend gestures in order to infer human intentions and establish more effective coordination with them. Prior work often rely on a rigid hand-coded library of gestures along with their meanings. However, interpretation of gestures is often context-dependent, requiring more flexibility and common-sense reasoning. In this work, we propose a framework, GIRAF, for more flexibly interpreting gesture and language instructions by leveraging the power of large language models. Our framework is able to accurately infer human intent and contextualize the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Hand Gesture Recognition Systems · Natural Language Processing Techniques
