FunctionChat-Bench: Comprehensive Evaluation of Language Models'   Generative Capabilities in Korean Tool-use Dialogs

Shinbok Lee; Gaeun Seo; Daniel Lee; Byeongil Ko; Sunghee Jung,; Myeongcheol Shin

arXiv:2411.14054·cs.CL·November 22, 2024

FunctionChat-Bench: Comprehensive Evaluation of Language Models' Generative Capabilities in Korean Tool-use Dialogs

Shinbok Lee, Gaeun Seo, Daniel Lee, Byeongil Ko, Sunghee Jung,, Myeongcheol Shin

PDF

Open Access 1 Repo

TL;DR

This paper introduces FunctionChat-Bench, a comprehensive benchmark for evaluating Korean language models' ability to generate tool-use dialogs, highlighting the gap between single-turn accuracy and multi-turn conversational performance.

Contribution

The study presents a new benchmark with automated evaluation for assessing language models' tool-use dialog capabilities, emphasizing multi-turn performance and conversational engagement.

Findings

01

High accuracy in single-turn Tool Call tasks does not ensure multi-turn success.

02

Models need to generate engaging conversational messages beyond tool calls.

03

FunctionChat-Bench enables detailed evaluation of language models' dialog generation abilities.

Abstract

This study investigates language models' generative capabilities in tool-use dialogs. We categorize the models' outputs in tool-use dialogs into four distinct types: Tool Call, Answer Completion, Slot Question, and Relevance Detection, which serve as aspects for evaluation. We introduce FunctionChat-Bench, comprising 700 evaluation items and automated assessment programs. Using this benchmark, we evaluate several language models that support function calling. Our findings indicate that while language models may exhibit high accuracy in single-turn Tool Call scenarios, this does not necessarily translate to superior generative performance in multi-turn environments. We argue that the capabilities required for function calling extend beyond generating tool call messages; they must also effectively generate conversational messages that engage the user.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kakao/functionchat-bench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Natural Language Processing Techniques