CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision
Gi-Cheon Kang, Junghyun Kim, Kyuhwan Shim, Jun Ki Lee, Byoung-Tak Zhang

TL;DR
This paper introduces CLIP-RT, a vision-language model that learns robotic manipulation policies from natural language supervision, enabling non-experts to teach robots skills efficiently and outperforming existing models in success rates.
Contribution
We propose a novel framework for collecting natural language-supervised robotic data and introduce CLIP-RT, a model that learns language-conditioned visuomotor policies from this data.
Findings
CLIP-RT outperforms OpenVLA by 24% in success rates.
CLIP-RT achieves 93.1% success on LIBERO benchmark.
CLIP-RT operates at 163 Hz inference throughput.
Abstract
Teaching robots desired skills in real-world environments remains challenging, especially for non-experts. A key bottleneck is that collecting robotic data often requires expertise or specialized hardware, limiting accessibility and scalability. We posit that natural language offers an intuitive and accessible interface for robot learning. To this end, we study two aspects: (1) enabling non-experts to collect robotic data through natural language supervision (e.g., "move the arm to the right") and (2) training robot policies directly from this supervision. Specifically, we introduce a data collection framework that collects robot demonstrations based on natural language supervision and further augments these demonstrations. We then present CLIP-RT, a new vision-language-action (VLA) model that learns language-conditioned visuomotor policies from this supervision. CLIP-RT adapts the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
