LLM Agents Already Know When to Call Tools -- Even Without Reasoning

Chung-En Sun; Linbo Liu; Ge Yan; Zimo Wang; Tsui-Wei Weng

arXiv:2605.09252·cs.CL·May 22, 2026

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

Chung-En Sun, Linbo Liu, Ge Yan, Zimo Wang, Tsui-Wei Weng

PDF

1 Repo 1 Datasets

TL;DR

This paper introduces When2Tool, a benchmark for assessing when large language models (LLMs) need to call tools, revealing that models already know when tools are necessary but often fail to act accordingly.

Contribution

It demonstrates that tool necessity can be linearly decoded from LLMs' hidden states and proposes Probe&Prefill to significantly reduce unnecessary tool calls.

Findings

01

Models can linearly decode tool necessity from hidden states with high AUROC.

02

Probe&Prefill reduces tool calls by 48% with minimal accuracy loss.

03

Baseline methods either suppress necessary calls or incur high accuracy costs.

Abstract

Tool-augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning three categories of tool necessity -- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of training-free baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Trustworthy-ML-Lab/when2tool
github

Datasets

cesun/When2Tool
dataset· 248 dl
248 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.