LLM Agents Already Know When to Call Tools -- Even Without Reasoning
Chung-En Sun, Linbo Liu, Ge Yan, Zimo Wang, Tsui-Wei Weng

TL;DR
This paper introduces When2Tool, a benchmark for assessing when large language models (LLMs) need to call tools, revealing that models already know when tools are necessary but often fail to act accordingly.
Contribution
It demonstrates that tool necessity can be linearly decoded from LLMs' hidden states and proposes Probe&Prefill to significantly reduce unnecessary tool calls.
Findings
Models can linearly decode tool necessity from hidden states with high AUROC.
Probe&Prefill reduces tool calls by 48% with minimal accuracy loss.
Baseline methods either suppress necessary calls or incur high accuracy costs.
Abstract
Tool-augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning three categories of tool necessity -- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of training-free baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
