K-VIL: Keypoints-based Visual Imitation Learning
Jianfeng Gao, Zhi Tao, No\'emie Jaquier, and Tamim Asfour

TL;DR
K-VIL introduces a method for robotic visual imitation that automatically extracts object-centric keypoints and geometric constraints from minimal demonstrations, enabling robust skill transfer in complex, real-world scenes.
Contribution
The paper presents a novel keypoint-based approach for visual imitation learning that works from a single demonstration and incrementally updates task representations.
Findings
Effective in cluttered scenes and viewpoint mismatches.
Capable of handling large object variations and new instances.
Works in one-shot and few-shot learning scenarios.
Abstract
Visual imitation learning provides efficient and intuitive solutions for robotic systems to acquire novel manipulation skills. However, simultaneously learning geometric task constraints and control policies from visual inputs alone remains a challenging problem. In this paper, we propose an approach for keypoint-based visual imitation (K-VIL) that automatically extracts sparse, object-centric, and embodiment-independent task representations from a small number of human demonstration videos. The task representation is composed of keypoint-based geometric constraints on principal manifolds, their associated local frames, and the movement primitives that are then needed for the task execution. Our approach is capable of extracting such task representations from a single demonstration video, and of incrementally updating them when new demonstrations become available. To reproduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Robot Manipulation and Learning
