TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation
William Shen, Nishanth Kumar, Sahit Chintalapudi, Jie Wang, Christopher Watson, Edward Hu, Jing Cao, Dinesh Jayaraman, Leslie Pack Kaelbling, Tom\'as Lozano-P\'erez

TL;DR
TiPToP is a modular, open-vocabulary robotic manipulation system that integrates pretrained vision models with planning, achieving competitive performance without requiring robot-specific training data.
Contribution
It introduces a flexible, easy-to-implement system combining vision foundation models with TAMP for manipulation tasks, with minimal setup and adaptation effort.
Findings
Matches or outperforms vision-language models fine-tuned on demonstrations
Analyzes failure modes at component level to guide improvements
Achieves high performance in simulation and real-world tasks
Abstract
We present TiPToP, an extensible modular system that combines pretrained vision foundation models with an existing Task and Motion Planner (TAMP) to solve multi-step manipulation tasks directly from input RGB images and natural-language instructions. Our system aims to be simple and easy-to-use: it can be installed and run on a standard DROID setup in under one hour and adapted to new embodiments with minimal effort. We evaluate TiPToP -- which requires zero robot data -- over 28 tabletop manipulation tasks in simulation and the real world and find it matches or outperforms , a vision-language-action (VLA) model fine-tuned on 350 hours of embodiment-specific demonstrations. TiPToP's modular architecture enables us to analyze the system's failure modes at the component level. We analyze results from an evaluation of 173 trials and identify directions for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Social Robot Interaction and HRI · Modular Robots and Swarm Intelligence
