Proactive Human-Robot Interaction using Visuo-Lingual Transformers

Pranay Mathur

arXiv:2310.02506·cs.RO·October 5, 2023

Proactive Human-Robot Interaction using Visuo-Lingual Transformers

Pranay Mathur

PDF

Open Access

TL;DR

This paper introduces ViLing-MMT, a vision-language transformer model enabling robots to interpret visual and lingual cues for proactive, goal-oriented human-robot collaboration, improving interaction intuitiveness.

Contribution

The paper presents a novel multimodal transformer architecture that predicts user goals and suggests intermediate tasks using visual, lingual, and object interaction data.

Findings

01

Effective in simulation and real-world scenarios

02

Accurately predicts user intentions

03

Proactively suggests relevant tasks

Abstract

Humans possess the innate ability to extract latent visuo-lingual cues to infer context through human interaction. During collaboration, this enables proactive prediction of the underlying intention of a series of tasks. In contrast, robotic agents collaborating with humans naively follow elementary instructions to complete tasks or use specific hand-crafted triggers to initiate proactive collaboration when working towards the completion of a goal. Endowing such robots with the ability to reason about the end goal and proactively suggest intermediate tasks will engender a much more intuitive method for human-robot collaboration. To this end, we propose a learning-based method that uses visual cues from the scene, lingual commands from a user and knowledge of prior object-object interaction to identify and proactively predict the underlying goal the user intends to achieve. Specifically,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Human Motion and Animation