KITE: Keypoint-Conditioned Policies for Semantic Manipulation
Priya Sundaresan, Suneel Belkhale, Dorsa Sadigh, Jeannette Bohg

TL;DR
KITE introduces a two-step framework using keypoints and instructions for precise semantic manipulation in robots, enabling accurate interpretation and execution of language commands across various real-world tasks.
Contribution
The paper presents a novel keypoint-conditioned approach that improves semantic manipulation and generalization in instruction-following robots, outperforming existing methods.
Findings
Achieves over 70% success rate in real-world tasks
Outperforms pre-trained language models and end-to-end visuomotor control
Effective in diverse applications like grasping and coffee-making
Abstract
While natural language offers a convenient shared interface for humans and robots, enabling robots to interpret and follow language commands remains a longstanding challenge in manipulation. A crucial step to realizing a performant instruction-following robot is achieving semantic manipulation, where a robot interprets language at different specificities, from high-level instructions like "Pick up the stuffed animal" to more detailed inputs like "Grab the left ear of the elephant." To tackle this, we propose Keypoints + Instructions to Execution (KITE), a two-step framework for semantic manipulation which attends to both scene semantics (distinguishing between different objects in a visual scene) and object semantics (precisely localizing different parts within an object instance). KITE first grounds an input instruction in a visual scene through 2D image keypoints, providing a highly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
MethodsOPT
