DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues

Kyochul Jang; Donghyeon Lee; Kyusik Kim; Dongseok Heo; Taewhoo Lee; Woojeong Kim; Bongwon Suh

arXiv:2506.22853·cs.CL·July 3, 2025

DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues

Kyochul Jang, Donghyeon Lee, Kyusik Kim, Dongseok Heo, Taewhoo Lee, Woojeong Kim, Bongwon Suh

PDF

Open Access 1 Datasets

TL;DR

This paper introduces DICE-BENCH, a new framework and dataset for evaluating large language models' ability to use tools effectively in multi-round, multi-party dialogues, addressing gaps in existing benchmarks.

Contribution

It presents DICE-BENCH and DICE-SCORE, novel tools for assessing and constructing realistic multi-turn, multi-agent dialogues with tool use, highlighting current limitations of LLMs.

Findings

01

Existing benchmarks have low DICE-SCORE, indicating poor tool-use consistency.

02

DICE-BENCH dataset contains 1,607 high-quality, realistic dialogue instances.

03

Significant improvements are needed for LLMs to handle real-world tool use effectively.

Abstract

Existing function-calling benchmarks focus on single-turn interactions. However, they overlook the complexity of real-world scenarios. To quantify how existing benchmarks address practical applications, we introduce DICE-SCORE, a metric that evaluates the dispersion of tool-related information such as function name and parameter values throughout the dialogue. Analyzing existing benchmarks through DICE-SCORE reveals notably low scores, highlighting the need for more realistic scenarios. To address this gap, we present DICE-BENCH, a framework that constructs practical function-calling datasets by synthesizing conversations through a tool graph that maintains dependencies across rounds and a multi-agent system with distinct personas to enhance dialogue naturalness. The final dataset comprises 1,607 high-DICE-SCORE instances. Our experiments on 19 LLMs with DICE-BENCH show that significant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

OfficerChul/DICE-BENCH
dataset· 478 dl
478 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Persona Design and Applications · AI in Service Interactions