ToolTalk: Evaluating Tool-Usage in a Conversational Setting
Nicholas Farn, Richard Shin

TL;DR
ToolTalk is a benchmark designed to evaluate large language models' ability to perform complex, multi-step tool usage in conversational settings, highlighting current performance gaps and guiding future improvements.
Contribution
This paper introduces ToolTalk, a comprehensive benchmark with simulated tools for automated evaluation of LLMs' multi-step tool-using capabilities in dialogue.
Findings
GPT-3.5 achieves 26% success rate on ToolTalk
GPT-4 achieves 50% success rate on ToolTalk
Analysis identifies key error categories and future directions
Abstract
Large language models (LLMs) have displayed massive improvements in reasoning and decision-making skills and can hold natural conversations with users. Many recent works seek to augment LLM-based assistants with external tools so they can access private or up-to-date information and carry out actions on behalf of users. To better measure the performance of these assistants, this paper introduces ToolTalk, a benchmark consisting of complex user intents requiring multi-step tool usage specified through dialogue. ToolTalk contains 28 tools grouped into 7 plugins, and includes a complete simulated implementation of each tool, allowing for fully automated evaluation of assistants that rely on execution feedback. ToolTalk also emphasizes tools that externally affect the world rather than only tools for referencing or searching information. We evaluate GPT-3.5 and GPT-4 on ToolTalk resulting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · AI in Service Interactions
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · Multi-Head Attention · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Dense Connections · Dropout · Softmax · Absolute Position Encodings
