ToolQA: A Dataset for LLM Question Answering with External Tools
Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, Chao Zhang

TL;DR
ToolQA is a new dataset designed to evaluate large language models' ability to effectively use external tools for question answering, addressing limitations of previous evaluation methods.
Contribution
The paper introduces ToolQA, a scalable, automated dataset with specialized tools to accurately assess LLMs' external tool-use reasoning capabilities.
Findings
Sets a new benchmark for LLM tool-use evaluation
Highlights strengths and weaknesses of current tool-use LLMs
Provides insights for future improvements in LLMs
Abstract
Large Language Models (LLMs) have demonstrated impressive performance in various NLP tasks, but they still suffer from challenges such as hallucination and weak numerical reasoning. To overcome these challenges, external tools can be used to enhance LLMs' question-answering abilities. However, current evaluation methods do not distinguish between questions that can be answered using LLMs' internal knowledge and those that require external information through tool use. To address this issue, we introduce a new dataset called ToolQA, which is designed to faithfully evaluate LLMs' ability to use external tools for question answering. Our development of ToolQA involved a scalable, automated process for dataset curation, along with 13 specialized tools designed for interaction with external knowledge in order to answer questions. Importantly, we strive to minimize the overlap between our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research
