FunctionChat-Bench: Comprehensive Evaluation of Language Models' Generative Capabilities in Korean Tool-use Dialogs
Shinbok Lee, Gaeun Seo, Daniel Lee, Byeongil Ko, Sunghee Jung,, Myeongcheol Shin

TL;DR
This paper introduces FunctionChat-Bench, a comprehensive benchmark for evaluating Korean language models' ability to generate tool-use dialogs, highlighting the gap between single-turn accuracy and multi-turn conversational performance.
Contribution
The study presents a new benchmark with automated evaluation for assessing language models' tool-use dialog capabilities, emphasizing multi-turn performance and conversational engagement.
Findings
High accuracy in single-turn Tool Call tasks does not ensure multi-turn success.
Models need to generate engaging conversational messages beyond tool calls.
FunctionChat-Bench enables detailed evaluation of language models' dialog generation abilities.
Abstract
This study investigates language models' generative capabilities in tool-use dialogs. We categorize the models' outputs in tool-use dialogs into four distinct types: Tool Call, Answer Completion, Slot Question, and Relevance Detection, which serve as aspects for evaluation. We introduce FunctionChat-Bench, comprising 700 evaluation items and automated assessment programs. Using this benchmark, we evaluate several language models that support function calling. Our findings indicate that while language models may exhibit high accuracy in single-turn Tool Call scenarios, this does not necessarily translate to superior generative performance in multi-turn environments. We argue that the capabilities required for function calling extend beyond generating tool call messages; they must also effectively generate conversational messages that engage the user.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Topic Modeling · Natural Language Processing Techniques
