AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning
Mingyang Song, Haoyu Sun, Jiawei Gu, Linjie Li, Luxin Xu, Ranjay Krishna, Yu Cheng

TL;DR
AdaReasoner is a multimodal model that learns to dynamically select and coordinate tools for visual reasoning, improving performance and generalization without explicit supervision.
Contribution
It introduces a scalable data pipeline, a reinforcement learning algorithm, and an adaptive mechanism for dynamic tool orchestration in multimodal reasoning models.
Findings
Achieves +24.9% improvement on average over baseline models.
Demonstrates strong tool-adaptive and generalization behaviors.
Outperforms proprietary systems like GPT-5 on multiple benchmarks.
Abstract
When humans face problems beyond their immediate capabilities, they rely on tools, providing a promising paradigm for improving visual reasoning in multimodal large language models (MLLMs). Effective reasoning, therefore, hinges on knowing which tools to use, when to invoke them, and how to compose them over multiple steps, even when faced with new tools or new tasks. We introduce \textbf{AdaReasoner}, a family of multimodal models that learn tool use as a general reasoning skill rather than as tool-specific or explicitly supervised behavior. AdaReasoner is enabled by (i) a scalable data curation pipeline exposing models to long-horizon, multi-step tool interactions; (ii) Tool-GRPO, a reinforcement learning algorithm that optimizes tool selection and sequencing based on end-task success; and (iii) an adaptive learning mechanism that dynamically regulates tool usage. Together, these…
Peer Reviews
Decision·ICLR 2026 Poster
The paper addresses an important and timely challenge in multimodal AI, how to move beyond single-tool usage toward adaptive, multi-step tool coordination. The problem is clearly defined, the motivation is well grounded, and the proposed framework is logically structured. The design combining curated multi-turn trajectories with reinforcement learning represents a meaningful step toward more adaptive and interpretable reasoning systems. The writing is exceptionally clear and the presentation is
While the empirical results are impressive, the methodological contribution is incremental. The proposed Tool-GRPO is effectively an application of existing GRPO with customized reward shaping and formatting constraints. The novelty lies primarily in system integration and data engineering rather than in algorithmic or theoretical innovation. A second limitation is the heavy reliance on manual, task-specific design. The “abstract problem-solving blueprints” that underpin the Cold Start data are
- A key limitation of the "rule-based reward structure" in R1-style methods is that it primarily optimizes the reasoning process and fails to directly improve the model’s underlying perceptual capabilities. AdaReasoner directly addresses this shortcoming: by leveraging the precise perceptual capabilities of external expert models and specialized tools, it ensures high-fidelity understanding of visual inputs, thereby enhancing the reliability of the entire reasoning pipeline. - Unlike previous me
The most notable weakness of this paper lies in the limitations of evaluating tool generalization ability, specifically the "oversimplified verification of new tools during inference" and "lack of adaptation to tool complexity". These limitations cast doubt on the generalizability of the research conclusions in more complex and diverse tool scenarios. See other Weaknesses in Questions.
The authors demonstrate on their method improves over a variety of baselines across several visual reasoning benchmarks, with sufficient ablation experiments as well. A common limitation about training-based approaches for tool integrated reasoning is that they may not generalize to introduced tools. The authors address this by showing that at inference time, adding an unseen tool (A*) improves performance.
The authors show that RL training allows the model to learn how much to use different tools ("adopt", "discard", "modulate") and call this an emergent behavior at multiple points throughout the paper. However, the way the authors use the term "emergent behavior" could benefit from some clarification / definition. Generally, emergent behaviors refer to nonobvious / surprising capabilities not explicitly optimized in the object and generally only "emerge" at scale. In this case, the method is deli
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Mobile Crowdsensing and Crowdsourcing
