ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration
Hongjin Su, Shizhe Diao, Ximing Lu, Mingjie Liu, Jiacheng Xu, Xin Dong, Yonggan Fu, Peter Belcak, Hanrong Ye, Hongxu Yin, Yi Dong, Evelina Bakhturina, Tao Yu, Yejin Choi, Jan Kautz, Pavlo Molchanov

TL;DR
ToolOrchestra introduces a small, reinforcement learning-based orchestrator that manages multiple tools to enhance the accuracy and efficiency of large language models on complex tasks, outperforming larger models like GPT-5.
Contribution
The paper presents a novel training method for small orchestrators that coordinate tools, achieving superior performance and efficiency compared to existing tool-use agents and large models.
Findings
Orchestrator achieves higher accuracy at lower cost than previous agents.
On HLE, Orchestrator outperforms GPT-5 with 2.5x efficiency gain.
Generalizes robustly to unseen tools, maintaining performance.
Abstract
Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5…
Peer Reviews
Decision·Submitted to ICLR 2026
The work addresses a timely challenge about how to efficiently orchestrate diverse model and tools. The system is generally well-motivated from both cognitive and engineering perspectives. The RL formulation is generally sound with explicit reward design of outcome, efficiency, and preference, considering different aspects comprehensively. The analysis in Section 6 is thorough and provide interpretability of results, the figures, especially Figure 4, gives concrete insights into the orchestratio
The baselines currently lacks diversity, as there are also recent works aboutt rule-based / logic-based orchestrator, and many other router workers that potentially the paper can compare itself with。 The reward or optimization details lacks justification. There’s no ablation study on reward scaling / instability, etc. Also, how different reward component is exactly unified is under-explained, particularly how to consider them in a unified unit. The data creation process is very artificial and he
- Besides the more commonly studied aspect of reasoning accuracy, ToolOrchestra also casts attention to improving efficiency and user preference alignment in agentic reasoning via RL, which are essential aspects valuable for test-time scaling-up and generalization of agentic systems. - The idea of using vectorized weights to balance the agentic reasoning accuracy, efficiency and preference of tool usage is interesting and effective. - The experimental analysis is comprehensive, which covers four
- The reward function of ToolOrchestra’s RL approach relies on close-sourced GPT-5 as a judge, which requires a relatively large amount of training cost on both money and latency. This reduces the efficiency value of the proposed method, and also limits reproducibility. It would be better to compare the GPT-5 judger with more light-weight and open-sourced reward models and mixing rule-based rewards on applicable tasks. - The experimental study lacks comparisons to other RL-based tool integration
- The idea of orchestrating both tool and model is interesting. Previous works mainly focus on model router / tool selection problem, but seldomly combine them together into consideration, which is quite new and has real world application value. - The reward design also considers user preference in the tool using problem, which is also very novel and valuable, as this may give user the control to steer the model to more use which tool instead of other ones. This user customization related train
- The reward design writing part in section 3.2 is too crowded and unnecessarily introduce many symbols. Such as for user preference alignment, it’s just the reward’s projection on user preference vector, which is clear just in one sentence. - The problem should actually be a user-tool-model-orchestrator four-sided decision making problem: the orchestrator needs to consider whether to align with user / call tool / call model / generate itself. The paper can further benefit by putting itself int
Code & Models
- 🤗nvidia/Nemotron-Orchestrator-8Bmodel· 7.0k dl· ♡ 5627.0k dl♡ 562
- 🤗Mungert/Orchestrator-8B-GGUFmodel· 407 dl· ♡ 2407 dl♡ 2
- 🤗cyankiwi/Nemotron-Orchestrator-8B-AWQ-4bitmodel· 2.0k dl· ♡ 22.0k dl♡ 2
- 🤗cyankiwi/Nemotron-Orchestrator-8B-AWQ-8bitmodel· 74 dl· ♡ 274 dl♡ 2
- 🤗Mungert/Nemotron-Orchestrator-8B-GGUFmodel· 168 dl· ♡ 1168 dl♡ 1
- 🤗ericlewis/Nemotron-Orchestrator-8B-NVFP4model· 34 dl34 dl
- 🤗ryanfortin/community-blend-qwen3-8bmodel· 7 dl7 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Machine Learning and Data Classification
