KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data
Grace Tang, Swetha Rajkumar, Yifei Zhou, Homer Rich Walke, Sergey, Levine, Kuan Fang

TL;DR
KALIE leverages pre-trained vision-language models and affordance-based learning to enable robots to perform manipulation tasks with minimal data, bypassing the need for robot-specific training data.
Contribution
It introduces a scalable method to adapt large pre-trained VLMs for robotic manipulation using affordance prediction and synthetic data generation.
Findings
KALIE achieves robust manipulation with only 50 example data points.
It outperforms baseline models using pre-trained VLMs.
The approach effectively generalizes to unseen objects and tasks.
Abstract
Building generalist robotic systems involves effectively endowing robots with the capabilities to handle novel objects in an open-world setting. Inspired by the advances of large pre-trained models, we propose Keypoint Affordance Learning from Imagined Environments (KALIE), which adapts pre-trained Vision Language Models (VLMs) for robotic control in a scalable manner. Instead of directly producing motor commands, KALIE controls the robot by predicting point-based affordance representations based on natural language instructions and visual observations of the scene. The VLM is trained on 2D images with affordances labeled by humans, bypassing the need for training data collected on robotic systems. Through an affordance-aware data synthesis pipeline, KALIE automatically creates massive high-quality training data based on limited example data manually collected by humans. We demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Robot Manipulation and Learning
