Robotic Visual Instruction

Yanbang Li; Ziyang Gong; Haoyang Li; Xiaoqi Huang; Haolan Kang; Guangping Bai; Xianzheng Ma

arXiv:2505.00693·cs.RO·July 29, 2025

Robotic Visual Instruction

Yanbang Li, Ziyang Gong, Haoyang Li, Xiaoqi Huang, Haolan Kang, Guangping Bai, Xianzheng Ma

PDF

1 Datasets

TL;DR

The paper introduces Robotic Visual Instruction (RoVI), a visual, object-centric method for guiding robots using hand-drawn sketches, enabling precise, interpretable, and generalizable task execution without verbal communication.

Contribution

It proposes RoVI for spatially precise robot guidance via visual sketches and develops VIEW, a pipeline leveraging vision-language models for interpreting RoVI and executing complex tasks.

Findings

01

Achieves 87.5% success rate in real-world unseen tasks

02

Effectively encodes spatial-temporal info into visual instructions

03

Demonstrates strong generalization across diverse tasks

Abstract

Recently, natural language has been the primary medium for human-robot interaction. However, its inherent lack of spatial precision introduces challenges for robotic task definition such as ambiguity and verbosity. Moreover, in some public settings where quiet is required, such as libraries or hospitals, verbal communication with robots is inappropriate. To address these limitations, we introduce the Robotic Visual Instruction (RoVI), a novel paradigm to guide robotic tasks through an object-centric, hand-drawn symbolic representation. RoVI effectively encodes spatial-temporal information into human-interpretable visual instructions through 2D sketches, utilizing arrows, circles, colors, and numbers to direct 3D robotic manipulation. To enable robots to understand RoVI better and generate precise actions based on RoVI, we present Visual Instruction Embodied Workflow (VIEW), a pipeline…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

yanbang/rovibook
dataset· 93 dl
93 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.