Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs
Zidan Wang, Rui Shen, Bradly Stadie

TL;DR
Wonderful Team leverages Vision Large Language Models for zero-shot high-level robotic planning directly from environment images, outperforming previous methods by integrating perception, control, and planning.
Contribution
The paper introduces a novel multi-agent VLLM framework for zero-shot robotic planning that eliminates the need for separate vision systems, enabling more integrated and effective high-level task execution.
Findings
40% success rate improvement on VimaBench
30% improvement over Trajectory Generators on drawing and wiping tasks
70% improvement on semantic reasoning tasks with linguistic constraints
Abstract
We introduce Wonderful Team, a multi-agent Vision Large Language Model (VLLM) framework for executing high-level robotic planning in a zero-shot regime. In our context, zero-shot high-level planning means that for a novel environment, we provide a VLLM with an image of the robot's surroundings and a task description, and the VLLM outputs the sequence of actions necessary for the robot to complete the task. Unlike previous methods for high-level visual planning for robotic manipulation, our method uses VLLMs for the entire planning process, enabling a more tightly integrated loop between perception, control, and planning. As a result, Wonderful Team's performance on real-world semantic and physical planning tasks often exceeds methods that rely on separate vision systems. For example, we see an average 40% success rate improvement on VimaBench over prior methods such as NLaP, an average…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Automated Systems
MethodsSparse Evolutionary Training
