SKIL: Semantic Keypoint Imitation Learning for Generalizable Data-efficient Manipulation

Shengjie Wang; Jiacheng You; Yihang Hu; Jiongye Li; Yang Gao

arXiv:2501.14400·cs.RO·July 3, 2025

SKIL: Semantic Keypoint Imitation Learning for Generalizable Data-efficient Manipulation

Shengjie Wang, Jiacheng You, Yihang Hu, Jiongye Li, Yang Gao

PDF

Open Access

TL;DR

SKIL introduces a novel framework that uses vision foundation models to automatically extract semantic keypoints, enabling robots to learn complex tasks efficiently with fewer demonstrations and high robustness to variations.

Contribution

The paper presents SKIL, a new method that leverages semantic keypoints for data-efficient, generalizable robotic imitation learning, significantly reducing sample complexity and supporting cross-embodiment learning.

Findings

01

Doubles baseline performance in simple tasks.

02

Achieves 70% success in long-horizon tasks with 30 demonstrations.

03

Demonstrates robustness to object and environmental variations.

Abstract

Real-world tasks such as garment manipulation and table rearrangement demand robots to perform generalizable, highly precise, and long-horizon actions. Although imitation learning has proven to be an effective approach for teaching robots new skills, large amounts of expert demonstration data are still indispensible for these complex tasks, resulting in high sample complexity and costly data collection. To address this, we propose Semantic Keypoint Imitation Learning (SKIL), a framework which automatically obtains semantic keypoints with the help of vision foundation models, and forms the descriptor of semantic keypoints that enables efficient imitation learning of complex robotic tasks with significantly lower sample complexity. In real-world experiments, SKIL doubles the performance of baseline methods in tasks such as picking a cup or mouse, while demonstrating exceptional robustness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Multimodal Machine Learning Applications