CrayonRobo: Object-Centric Prompt-Driven Vision-Language-Action Model   for Robotic Manipulation

Xiaoqi Li; Lingyun Xu; Mingxu Zhang; Jiaming Liu; Yan Shen; Iaroslav; Ponomarenko; Jiahui Xu; Liang Heng; Siyuan Huang; Shanghang Zhang; Hao Dong

arXiv:2505.02166·cs.RO·May 6, 2025

CrayonRobo: Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation

Xiaoqi Li, Lingyun Xu, Mingxu Zhang, Jiaming Liu, Yan Shen, Iaroslav, Ponomarenko, Jiahui Xu, Liang Heng, Siyuan Huang, Shanghang Zhang, Hao Dong

PDF

Open Access

TL;DR

CrayonRobo introduces a multi-modal prompt-driven approach for robotic manipulation, enabling explicit task goal communication through visual prompts, which improves robustness and interpretability in long-horizon tasks across simulated and real environments.

Contribution

The paper presents a novel multi-modal prompt system that explicitly encodes task goals and a training strategy for interpreting these prompts in robotic manipulation.

Findings

01

Effective in both simulated and real-world environments

02

Improves robustness on unseen tasks

03

Enables explicit understanding of task objectives

Abstract

In robotic, task goals can be conveyed through various modalities, such as language, goal images, and goal videos. However, natural language can be ambiguous, while images or videos may offer overly detailed specifications. To tackle these challenges, we introduce CrayonRobo that leverages comprehensive multi-modal prompts that explicitly convey both low-level actions and high-level planning in a simple manner. Specifically, for each key-frame in the task sequence, our method allows for manual or automatic generation of simple and expressive 2D visual prompts overlaid on RGB images. These prompts represent the required task goals, such as the end-effector pose and the desired movement direction after contact. We develop a training strategy that enables the model to interpret these visual-language prompts and predict the corresponding contact poses and movement directions in SE(3) space.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Robotics and Automated Systems