ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of   Large Language Models in Real-world Scenarios

Junjie Ye; Guanyu Li; Songyang Gao; Caishuang Huang; Yilong Wu; Sixian; Li; Xiaoran Fan; Shihan Dou; Tao Ji; Qi Zhang; Tao Gui; Xuanjing Huang

arXiv:2401.00741·cs.CL·December 6, 2024·1 cites

ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios

Junjie Ye, Guanyu Li, Songyang Gao, Caishuang Huang, Yilong Wu, Sixian, Li, Xiaoran Fan, Shihan Dou, Tao Ji, Qi Zhang, Tao Gui, Xuanjing Huang

PDF

Open Access 1 Repo

TL;DR

ToolEyes introduces a detailed evaluation system for assessing large language models' ability to learn and effectively use tools in real-world scenarios, addressing limitations of previous outcome-focused methods.

Contribution

It presents a novel, fine-grained evaluation framework that analyzes multiple dimensions of tool learning in authentic scenarios, incorporating a large tool library and comprehensive analysis.

Findings

01

LLMs show a preference for certain scenarios.

02

Limited cognitive abilities hinder tool learning.

03

Larger models may perform worse in tool learning.

Abstract

Existing evaluations of tool learning primarily focus on validating the alignment of selected tools for large language models (LLMs) with expected outcomes. However, these approaches rely on a limited set of scenarios where answers can be pre-determined, diverging from genuine needs. Furthermore, a sole emphasis on outcomes disregards the complex capabilities required for LLMs to effectively use tools. To tackle this issue, we propose ToolEyes, a fine-grained system tailored for the evaluation of the LLMs' tool learning capabilities in authentic scenarios. The system meticulously examines seven real-world scenarios, analyzing five dimensions crucial to LLMs in tool learning: format alignment, intent comprehension, behavior planning, tool selection, and answer organization. Additionally, ToolEyes incorporates a tool library boasting approximately 600 tools, serving as an intermediary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

junjie-ye/tooleyes
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research

MethodsSparse Evolutionary Training · Lib · Focus