TL;DR
This paper introduces WildToolBench, a benchmark based on real user behavior to evaluate LLMs' ability to handle complex, messy, and dynamic tool-use scenarios, revealing significant gaps in current models.
Contribution
The paper presents WildToolBench, the first benchmark grounded in real-world user interactions, highlighting the challenges LLMs face in practical tool-use tasks.
Findings
No LLM exceeds 15% accuracy on WildToolBench.
Existing benchmarks overlook real user behavior complexities.
Real challenges stem from wild user interactions, not task complexity.
Abstract
Fulfilling user needs through Large Language Model multi-turn, multi-step tool-use is rarely a straightforward process. Real user interactions are inherently wild, being intricate, messy, and flexible. We identify three key challenges from user behaviour: compositional tasks that demand efficient orchestration of tool-call topologies, implicit intent spread across dialogue turns that require contextual inference, and instruction transition, which mixes task queries, clarifications, and casual conversation, forcing LLMs to adjust their policies on the fly. Existing benchmarks overlook these behaviors, making the apparent progress of LLMs on tool-use spurious. To address this, we introduce WildToolBench, an LLM tool-use benchmark grounded in real-world user behavior patterns. Comprehensive evaluations of 57 LLMs reveal that no model achieves an accuracy of more than 15%, indicating a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
