CLIPort: What and Where Pathways for Robotic Manipulation
Mohit Shridhar, Lucas Manuelli, Dieter Fox

TL;DR
CLIPort is a robotic manipulation framework that combines semantic understanding from CLIP with spatial reasoning to perform diverse language-guided tasks efficiently in both simulated and real environments.
Contribution
It introduces a two-stream architecture integrating semantic and spatial pathways, enabling generalizable, language-conditioned manipulation without explicit pose or symbolic representations.
Findings
Effective in few-shot learning scenarios
Generalizes to unseen semantic concepts
Single multi-task policy performs comparably to multiple single-task policies
Abstract
How can we imbue robots with the ability to manipulate objects precisely but also to reason about them in terms of abstract concepts? Recent works in manipulation have shown that end-to-end networks can learn dexterous skills that require precise spatial reasoning, but these methods often fail to generalize to new goals or quickly learn transferable concepts across tasks. In parallel, there has been great progress in learning generalizable semantic representations for vision and language by training on large-scale internet data, however these representations lack the spatial understanding necessary for fine-grained manipulation. To this end, we propose a framework that combines the best of both worlds: a two-stream architecture with semantic and spatial pathways for vision-based manipulation. Specifically, we present CLIPort, a language-conditioned imitation-learning agent that combines…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
MethodsCLIPort · Contrastive Language-Image Pre-training
