ObjectVLA: End-to-End Open-World Object Manipulation Without Demonstration
Minjie Zhu, Yichen Zhu, Jinming Li, Zhongyi Zhou, Junjie Wen, Xiaoyu, Liu, Chaomin Shen, Yaxin Peng, Feifei Feng

TL;DR
This paper introduces ObjectVLA, a vision-language-action model that enables robots to generalize manipulation skills to new objects without explicit demonstrations, using minimal data and fine-tuning.
Contribution
The paper presents a novel approach for object generalization in robotic manipulation through VLA models, reducing reliance on human demonstrations and enabling zero-shot transfer.
Findings
Successfully generalized to 100 novel objects with 64% success rate
Utilized minimal data via smartphone images for fine-tuning
Demonstrated effective object-level generalization in real robots
Abstract
Imitation learning has proven to be highly effective in teaching robots dexterous manipulation skills. However, it typically relies on large amounts of human demonstration data, which limits its scalability and applicability in dynamic, real-world environments. One key challenge in this context is object generalization, where a robot trained to perform a task with one object, such as "hand over the apple," struggles to transfer its skills to a semantically similar but visually different object, such as "hand over the peach." This gap in generalization to new objects beyond those in the same category has yet to be adequately addressed in previous work on end-to-end visuomotor policy learning. In this paper, we present a simple yet effective approach for achieving object generalization through Vision-Language-Action (VLA) models, referred to as \textbf{ObjectVLA}. Our model enables robots…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
