KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation   without Robot Data

Grace Tang; Swetha Rajkumar; Yifei Zhou; Homer Rich Walke; Sergey; Levine; Kuan Fang

arXiv:2409.14066·cs.RO·September 24, 2024·2 cites

KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data

Grace Tang, Swetha Rajkumar, Yifei Zhou, Homer Rich Walke, Sergey, Levine, Kuan Fang

PDF

Open Access

TL;DR

KALIE leverages pre-trained vision-language models and affordance-based learning to enable robots to perform manipulation tasks with minimal data, bypassing the need for robot-specific training data.

Contribution

It introduces a scalable method to adapt large pre-trained VLMs for robotic manipulation using affordance prediction and synthetic data generation.

Findings

01

KALIE achieves robust manipulation with only 50 example data points.

02

It outperforms baseline models using pre-trained VLMs.

03

The approach effectively generalizes to unseen objects and tasks.

Abstract

Building generalist robotic systems involves effectively endowing robots with the capabilities to handle novel objects in an open-world setting. Inspired by the advances of large pre-trained models, we propose Keypoint Affordance Learning from Imagined Environments (KALIE), which adapts pre-trained Vision Language Models (VLMs) for robotic control in a scalable manner. Instead of directly producing motor commands, KALIE controls the robot by predicting point-based affordance representations based on natural language instructions and visual observations of the scene. The VLM is trained on 2D images with affordances labeled by humans, bypassing the need for training data collected on robotic systems. Through an affordance-aware data synthesis pipeline, KALIE automatically creates massive high-quality training data based on limited example data manually collected by humans. We demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Robot Manipulation and Learning