TL;DR
Maestro is a reinforcement learning framework that dynamically orchestrates ensembles of models and skills for multimodal tasks, outperforming large monolithic models with low latency.
Contribution
It introduces a hierarchical, RL-driven orchestration method that effectively combines multiple models and skills without retraining, enhancing multimodal task performance.
Findings
Maestro surpasses GPT-5 and Gemini-2.5-Pro in accuracy on multimodal benchmarks.
The learned policy generalizes to unseen models and skills without retraining.
Maestro maintains high efficiency with low latency.
Abstract
The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottleneck: different LLMs offer distinct advantages across diverse domains, yet current frameworks fail to exploit the complementary strengths of models and skills, thereby limiting their performance on downstream tasks. In this paper, we present Maestro (Multimodal Agent for Expert-Skill Targeted Reinforced Orchestration), a Reinforcement Learning (RL)-driven orchestration framework that reframes heterogeneous multimodal tasks as a sequential decision-making process over a hierarchical model-skill registry. Rather than consolidating all knowledge into a single model, Maestro trains a lightweight policy to dynamically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
