Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

Shilin Yan; Jintao Tong; Hongwei Xue; Xiaojun Tang; Yangyang Wang; Kunyu Shi; Guannan Zhang; Ruixuan Li; Yixiong Zou

arXiv:2604.08545·cs.CV·April 10, 2026

Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

Shilin Yan, Jintao Tong, Hongwei Xue, Xiaojun Tang, Yangyang Wang, Kunyu Shi, Guannan Zhang, Ruixuan Li, Yixiong Zou

PDF

1 Repo 2 Models 2 Datasets

TL;DR

This paper introduces HDPO, a new framework for agentic multimodal models that improves tool use efficiency and reasoning accuracy by decoupling optimization objectives, leading to fewer unnecessary tool invocations.

Contribution

It proposes HDPO, a novel approach that separates accuracy and efficiency optimization, enabling agents to better arbitrate internal knowledge and external tool use.

Findings

01

Metis reduces tool invocations by orders of magnitude.

02

Metis achieves higher reasoning accuracy.

03

HDPO outperforms existing reinforcement learning protocols.

Abstract

The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a profound meta-cognitive deficit: they struggle to arbitrate between leveraging internal knowledge and querying external utilities. Consequently, they frequently fall prey to blind tool invocation, resorting to reflexive tool execution even when queries are resolvable from the raw visual context. This pathological behavior precipitates severe latency bottlenecks and injects extraneous noise that derails sound reasoning. Existing reinforcement learning protocols attempt to mitigate this via a scalarized reward that penalizes tool usage. Yet, this coupled formulation creates an irreconcilable optimization dilemma: an aggressive penalty suppresses essential tool use, whereas a mild penalty is entirely subsumed by the variance of the accuracy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

accio-lab/Metis
github

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.