Uncertainty Quantification for LLM Function-Calling
Zihuiwen Ye, Lukas Aichberger, Michael Kirchhof, Sinead Williamson, Luca Zappella, Yarin Gal, Arno Blaas, Adam Golinski

TL;DR
This paper evaluates uncertainty quantification methods for large language model function-calling, highlighting how specific adaptations can improve confidence estimates in critical real-world applications.
Contribution
First evaluation of UQ methods for LLM function-calling, demonstrating how output clustering and token selection improve uncertainty estimation performance.
Findings
Multi-sample UQ methods perform well in natural language Q&A but not necessarily better in FC.
Clustering FC outputs by syntax improves multi-sample UQ performance.
Selecting semantically meaningful tokens enhances logit-based UQ accuracy.
Abstract
Large Language Models (LLMs) are increasingly deployed to autonomously solve real-world tasks. A key ingredient for this is the LLM Function-Calling paradigm, a widely used approach for equipping LLMs with tool-use capabilities. However, an LLM calling functions incorrectly can have severe implications, especially when their effects are irreversible, e.g., transferring money or deleting data. Hence, it is of paramount importance to consider the LLM's confidence that a function call solves the task correctly prior to executing it. Uncertainty Quantification (UQ) methods can be used to quantify this confidence and prevent potentially incorrect function calls. In this work, we present what is, to our knowledge, the first evaluation of UQ methods for LLM Function-Calling (FC). While multi-sample UQ methods, such as Semantic Entropy, show strong performance for natural language Q&A tasks, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
