An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction

Guanting Shen; Zi Tian

arXiv:2602.20219·cs.RO·February 25, 2026

An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction

Guanting Shen, Zi Tian

PDF

Open Access

TL;DR

This paper introduces a multimodal human-robot interaction framework combining vision, speech, and language models to improve command understanding and control of a robotic arm, demonstrating promising accuracy on standard hardware.

Contribution

It presents a novel integrated system that combines vision-language models, speech recognition, and fuzzy logic for enhanced HRI, advancing the state-of-the-art in natural human-robot collaboration.

Findings

01

75% command execution accuracy on consumer hardware

02

Effective integration of Florence-2, Llama 3.1, and Whisper models

03

Flexible architecture for future HRI research

Abstract

Interpreting human intent accurately is a central challenge in human-robot interaction (HRI) and a key requirement for achieving more natural and intuitive collaboration between humans and machines. This work presents a novel multimodal HRI framework that combines advanced vision-language models, speech processing, and fuzzy logic to enable precise and adaptive control of a Dobot Magician robotic arm. The proposed system integrates Florence-2 for object detection, Llama 3.1 for natural language understanding, and Whisper for speech recognition, providing users with a seamless and intuitive interface for object manipulation through spoken commands. By jointly addressing scene perception and action planning, the approach enhances the reliability of command interpretation and execution. Experimental evaluations conducted on consumer-grade hardware demonstrate a command execution accuracy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSocial Robot Interaction and HRI · Hand Gesture Recognition Systems · Multimodal Machine Learning Applications