SKIL: Semantic Keypoint Imitation Learning for Generalizable Data-efficient Manipulation
Shengjie Wang, Jiacheng You, Yihang Hu, Jiongye Li, Yang Gao

TL;DR
SKIL introduces a novel framework that uses vision foundation models to automatically extract semantic keypoints, enabling robots to learn complex tasks efficiently with fewer demonstrations and high robustness to variations.
Contribution
The paper presents SKIL, a new method that leverages semantic keypoints for data-efficient, generalizable robotic imitation learning, significantly reducing sample complexity and supporting cross-embodiment learning.
Findings
Doubles baseline performance in simple tasks.
Achieves 70% success in long-horizon tasks with 30 demonstrations.
Demonstrates robustness to object and environmental variations.
Abstract
Real-world tasks such as garment manipulation and table rearrangement demand robots to perform generalizable, highly precise, and long-horizon actions. Although imitation learning has proven to be an effective approach for teaching robots new skills, large amounts of expert demonstration data are still indispensible for these complex tasks, resulting in high sample complexity and costly data collection. To address this, we propose Semantic Keypoint Imitation Learning (SKIL), a framework which automatically obtains semantic keypoints with the help of vision foundation models, and forms the descriptor of semantic keypoints that enables efficient imitation learning of complex robotic tasks with significantly lower sample complexity. In real-world experiments, SKIL doubles the performance of baseline methods in tasks such as picking a cup or mouse, while demonstrating exceptional robustness…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Multimodal Machine Learning Applications
