Proactive Human-Robot Interaction using Visuo-Lingual Transformers
Pranay Mathur

TL;DR
This paper introduces ViLing-MMT, a vision-language transformer model enabling robots to interpret visual and lingual cues for proactive, goal-oriented human-robot collaboration, improving interaction intuitiveness.
Contribution
The paper presents a novel multimodal transformer architecture that predicts user goals and suggests intermediate tasks using visual, lingual, and object interaction data.
Findings
Effective in simulation and real-world scenarios
Accurately predicts user intentions
Proactively suggests relevant tasks
Abstract
Humans possess the innate ability to extract latent visuo-lingual cues to infer context through human interaction. During collaboration, this enables proactive prediction of the underlying intention of a series of tasks. In contrast, robotic agents collaborating with humans naively follow elementary instructions to complete tasks or use specific hand-crafted triggers to initiate proactive collaboration when working towards the completion of a goal. Endowing such robots with the ability to reason about the end goal and proactively suggest intermediate tasks will engender a much more intuitive method for human-robot collaboration. To this end, we propose a learning-based method that uses visual cues from the scene, lingual commands from a user and knowledge of prior object-object interaction to identify and proactively predict the underlying goal the user intends to achieve. Specifically,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Human Motion and Animation
