Multimodal Human-Autonomous Agents Interaction Using Pre-Trained   Language and Visual Foundation Models

Linus Nwankwo; Elmar Rueckert

arXiv:2403.12273·cs.RO·December 31, 2024·1 cites

Multimodal Human-Autonomous Agents Interaction Using Pre-Trained Language and Visual Foundation Models

Linus Nwankwo, Elmar Rueckert

PDF

Open Access

TL;DR

This paper presents a multimodal interaction framework enabling natural vocal and textual communication between humans and autonomous robots, leveraging pre-trained language and visual models for understanding and executing commands.

Contribution

It extends existing methods by integrating large language models, visual language models, and speech recognition to improve natural human-robot interaction capabilities.

Findings

01

87.55% vocal command decoding accuracy

02

86.27% commands execution success rate

03

0.89 seconds average latency

Abstract

In this paper, we extended the method proposed in [21] to enable humans to interact naturally with autonomous agents through vocal and textual conversations. Our extended method exploits the inherent capabilities of pre-trained large language models (LLMs), multimodal visual language models (VLMs), and speech recognition (SR) models to decode the high-level natural language conversations and semantic understanding of the robot's task environment, and abstract them to the robot's actionable commands or queries. We performed a quantitative evaluation of our framework's natural vocal conversation understanding with participants from different racial backgrounds and English language accents. The participants interacted with the robot using both spoken and textual instructional commands. Based on the logged interaction data, our framework achieved 87.55% vocal commands decoding accuracy,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications