Robotic Visual Instruction
Yanbang Li, Ziyang Gong, Haoyang Li, Xiaoqi Huang, Haolan Kang, Guangping Bai, Xianzheng Ma

TL;DR
The paper introduces Robotic Visual Instruction (RoVI), a visual, object-centric method for guiding robots using hand-drawn sketches, enabling precise, interpretable, and generalizable task execution without verbal communication.
Contribution
It proposes RoVI for spatially precise robot guidance via visual sketches and develops VIEW, a pipeline leveraging vision-language models for interpreting RoVI and executing complex tasks.
Findings
Achieves 87.5% success rate in real-world unseen tasks
Effectively encodes spatial-temporal info into visual instructions
Demonstrates strong generalization across diverse tasks
Abstract
Recently, natural language has been the primary medium for human-robot interaction. However, its inherent lack of spatial precision introduces challenges for robotic task definition such as ambiguity and verbosity. Moreover, in some public settings where quiet is required, such as libraries or hospitals, verbal communication with robots is inappropriate. To address these limitations, we introduce the Robotic Visual Instruction (RoVI), a novel paradigm to guide robotic tasks through an object-centric, hand-drawn symbolic representation. RoVI effectively encodes spatial-temporal information into human-interpretable visual instructions through 2D sketches, utilizing arrows, circles, colors, and numbers to direct 3D robotic manipulation. To enable robots to understand RoVI better and generate precise actions based on RoVI, we present Visual Instruction Embodied Workflow (VIEW), a pipeline…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
