CONFETTI: Conversational Function-Calling Evaluation Through Turn-Level Interactions

Tamer Alkhouli; Katerina Margatina; James Gung; Raphael Shu; Claudia Zaghi; Monica Sunkara; Yi Zhang

arXiv:2506.01859·cs.CL·June 3, 2025

CONFETTI: Conversational Function-Calling Evaluation Through Turn-Level Interactions

Tamer Alkhouli, Katerina Margatina, James Gung, Raphael Shu, Claudia Zaghi, Monica Sunkara, Yi Zhang

PDF

Open Access 1 Video

TL;DR

CONFETTI is a comprehensive benchmark for evaluating large language models' ability to perform function calling in complex, turn-based conversations, revealing strengths and limitations in handling APIs, context length, and chained calls.

Contribution

This paper introduces CONFETTI, a novel benchmark with diverse conversational scenarios and API interactions to assess LLMs' function-calling capabilities in realistic settings.

Findings

01

Some models handle long conversations well

02

Models can leverage over 20 APIs successfully

03

Performance on chained function-calls is limited

Abstract

We introduce Conversational Function-Calling Evaluation Through Turn-Level Interactions (CONFETTI), a conversational benchmark1 designed to evaluate the function-calling capabilities and response quality of large language models (LLMs). Current benchmarks lack comprehensive assessment of LLMs in complex conversational scenarios. CONFETTI addresses this gap through 109 human-simulated conversations, comprising 313 user turns and covering 86 APIs. These conversations explicitly target various conversational complexities, such as follow-ups, goal correction and switching, ambiguous and implicit goals. We perform off-policy turn-level evaluation using this benchmark targeting function-calling. Our benchmark also incorporates dialog act annotations to assess agent responses. We evaluate a series of state-of-the-art LLMs and analyze their performance with respect to the number of available…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CONFETTI: Conversational Function-Calling Evaluation Through Turn-Level Interactions· underline

Taxonomy

TopicsReligious, Philosophical, and Educational Studies · Media, Religion, Digital Communication · Language, Metaphor, and Cognition