SMARTCAL: An Approach to Self-Aware Tool-Use Evaluation and Calibration

Yuanhao Shen; Xiaodan Zhu; Lei Chen

arXiv:2412.12151·cs.LG·December 18, 2024

SMARTCAL: An Approach to Self-Aware Tool-Use Evaluation and Calibration

Yuanhao Shen, Xiaodan Zhu, Lei Chen

PDF

Open Access 1 Repo

TL;DR

This paper investigates the self-awareness and calibration issues of large language models in tool use, revealing overconfidence and misuse, and proposes SMARTCAL to improve performance and calibration accuracy.

Contribution

The paper introduces SMARTCAL, a novel approach to enhance LLMs' self-awareness and calibration in tool use, addressing a critical trustworthiness challenge.

Findings

01

8.6% increase in QA performance

02

21.6% decrease in calibration error

03

Tool-abuse behavior is common across models

Abstract

The tool-use ability of Large Language Models (LLMs) has a profound impact on a wide range of industrial applications. However, LLMs' self-control and calibration capability in appropriately using tools remains understudied. The problem is consequential as it raises potential risks of degraded performance and poses a threat to the trustworthiness of the models. In this paper, we conduct a study on a family of state-of-the-art LLMs on three datasets with two mainstream tool-use frameworks. Our study reveals the tool-abuse behavior of LLMs, a tendency for models to misuse tools with overconfidence. We also find that this is a common issue regardless of model capability. Accordingly, we propose a novel approach, \textit{SMARTCAL}, to mitigate the observed issues, and our results show an average of 8.6 percent increase in the QA performance and a 21.6 percent decrease in Expected…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

henrysyh2000/smartcal
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBusiness Process Modeling and Analysis · Scientific Computing and Data Management · Semantic Web and Ontologies