Maestro: Orchestrating Robotics Modules with Vision-Language Models for Zero-Shot Generalist Robots

Junyao Shi; Rujia Yang; Kaitian Chao; Selina Bingqing Wan; Yifei Shao; Jiahui Lei; Jianing Qian; Long Le; Pratik Chaudhari; Kostas Daniilidis; Chuan Wen; Dinesh Jayaraman

arXiv:2511.00917·cs.RO·November 20, 2025

Maestro: Orchestrating Robotics Modules with Vision-Language Models for Zero-Shot Generalist Robots

Junyao Shi, Rujia Yang, Kaitian Chao, Selina Bingqing Wan, Yifei Shao, Jiahui Lei, Jianing Qian, Long Le, Pratik Chaudhari, Kostas Daniilidis, Chuan Wen, Dinesh Jayaraman

PDF

Open Access

TL;DR

Maestro leverages vision-language models to dynamically compose robot modules into adaptable policies, achieving superior zero-shot manipulation performance and easy extensibility for diverse robotic embodiments.

Contribution

It introduces a modular framework that integrates VLMs with robot-specific modules, enabling flexible, zero-shot generalist robot behaviors without extensive dataset training.

Findings

01

Outperforms existing VLA models in zero-shot manipulation tasks.

02

Easily adaptable to new robot embodiments and modules.

03

Requires minimal real-world data for adaptation.

Abstract

Today's best-explored routes towards generalist robots center on collecting ever larger "observations-in actions-out" robotics datasets to train large end-to-end models, copying a recipe that has worked for vision-language models (VLMs). We pursue a road less traveled: building generalist policies directly around VLMs by augmenting their general capabilities with specific robot capabilities encapsulated in a carefully curated set of perception, planning, and control modules. In Maestro, a VLM coding agent dynamically composes these modules into a programmatic policy for the current task and scenario. Maestro's architecture benefits from a streamlined closed-loop interface without many manually imposed structural constraints, and a comprehensive and diverse tool repertoire. As a result, it largely surpasses today's VLA models for zero-shot performance on challenging manipulation skills.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Reinforcement Learning in Robotics