ObjectVLA: End-to-End Open-World Object Manipulation Without   Demonstration

Minjie Zhu; Yichen Zhu; Jinming Li; Zhongyi Zhou; Junjie Wen; Xiaoyu; Liu; Chaomin Shen; Yaxin Peng; Feifei Feng

arXiv:2502.19250·cs.RO·March 3, 2025

ObjectVLA: End-to-End Open-World Object Manipulation Without Demonstration

Minjie Zhu, Yichen Zhu, Jinming Li, Zhongyi Zhou, Junjie Wen, Xiaoyu, Liu, Chaomin Shen, Yaxin Peng, Feifei Feng

PDF

TL;DR

This paper introduces ObjectVLA, a vision-language-action model that enables robots to generalize manipulation skills to new objects without explicit demonstrations, using minimal data and fine-tuning.

Contribution

The paper presents a novel approach for object generalization in robotic manipulation through VLA models, reducing reliance on human demonstrations and enabling zero-shot transfer.

Findings

01

Successfully generalized to 100 novel objects with 64% success rate

02

Utilized minimal data via smartphone images for fine-tuning

03

Demonstrated effective object-level generalization in real robots

Abstract

Imitation learning has proven to be highly effective in teaching robots dexterous manipulation skills. However, it typically relies on large amounts of human demonstration data, which limits its scalability and applicability in dynamic, real-world environments. One key challenge in this context is object generalization, where a robot trained to perform a task with one object, such as "hand over the apple," struggles to transfer its skills to a semantically similar but visually different object, such as "hand over the peach." This gap in generalization to new objects beyond those in the same category has yet to be adequately addressed in previous work on end-to-end visuomotor policy learning. In this paper, we present a simple yet effective approach for achieving object generalization through Vision-Language-Action (VLA) models, referred to as \textbf{ObjectVLA}. Our model enables robots…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.