An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction
Guanting Shen, Zi Tian

TL;DR
This paper introduces a multimodal human-robot interaction framework combining vision, speech, and language models to improve command understanding and control of a robotic arm, demonstrating promising accuracy on standard hardware.
Contribution
It presents a novel integrated system that combines vision-language models, speech recognition, and fuzzy logic for enhanced HRI, advancing the state-of-the-art in natural human-robot collaboration.
Findings
75% command execution accuracy on consumer hardware
Effective integration of Florence-2, Llama 3.1, and Whisper models
Flexible architecture for future HRI research
Abstract
Interpreting human intent accurately is a central challenge in human-robot interaction (HRI) and a key requirement for achieving more natural and intuitive collaboration between humans and machines. This work presents a novel multimodal HRI framework that combines advanced vision-language models, speech processing, and fuzzy logic to enable precise and adaptive control of a Dobot Magician robotic arm. The proposed system integrates Florence-2 for object detection, Llama 3.1 for natural language understanding, and Whisper for speech recognition, providing users with a seamless and intuitive interface for object manipulation through spoken commands. By jointly addressing scene perception and action planning, the approach enhances the reliability of command interpretation and execution. Experimental evaluations conducted on consumer-grade hardware demonstrate a command execution accuracy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSocial Robot Interaction and HRI · Hand Gesture Recognition Systems · Multimodal Machine Learning Applications
