CONFETTI: Conversational Function-Calling Evaluation Through Turn-Level Interactions
Tamer Alkhouli, Katerina Margatina, James Gung, Raphael Shu, Claudia Zaghi, Monica Sunkara, Yi Zhang

TL;DR
CONFETTI is a comprehensive benchmark for evaluating large language models' ability to perform function calling in complex, turn-based conversations, revealing strengths and limitations in handling APIs, context length, and chained calls.
Contribution
This paper introduces CONFETTI, a novel benchmark with diverse conversational scenarios and API interactions to assess LLMs' function-calling capabilities in realistic settings.
Findings
Some models handle long conversations well
Models can leverage over 20 APIs successfully
Performance on chained function-calls is limited
Abstract
We introduce Conversational Function-Calling Evaluation Through Turn-Level Interactions (CONFETTI), a conversational benchmark1 designed to evaluate the function-calling capabilities and response quality of large language models (LLMs). Current benchmarks lack comprehensive assessment of LLMs in complex conversational scenarios. CONFETTI addresses this gap through 109 human-simulated conversations, comprising 313 user turns and covering 86 APIs. These conversations explicitly target various conversational complexities, such as follow-ups, goal correction and switching, ambiguous and implicit goals. We perform off-policy turn-level evaluation using this benchmark targeting function-calling. Our benchmark also incorporates dialog act annotations to assess agent responses. We evaluate a series of state-of-the-art LLMs and analyze their performance with respect to the number of available…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReligious, Philosophical, and Educational Studies · Media, Religion, Digital Communication · Language, Metaphor, and Cognition
