MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language   Models in Multi-Turn Dialogues

Ge Bai; Jie Liu; Xingyuan Bu; Yancheng He; Jiaheng Liu; Zhanhui Zhou,; Zhuoran Lin; Wenbo Su; Tiezheng Ge; Bo Zheng; Wanli Ouyang

arXiv:2402.14762·cs.CL·November 6, 2024·2 cites

MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues

Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou,, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, Wanli Ouyang

PDF

Open Access 1 Repo 1 Video

TL;DR

MT-Bench-101 introduces a detailed, hierarchical benchmark for evaluating the nuanced multi-turn dialogue capabilities of large language models, addressing gaps left by previous coarse assessments.

Contribution

It presents a new fine-grained, multi-level evaluation framework and dataset for assessing LLMs in complex multi-turn dialogues, with comprehensive analysis across models and tasks.

Findings

01

Performance varies significantly across dialogue turns and tasks.

02

Common alignment and chat-specific techniques do not improve multi-turn abilities.

03

The benchmark accurately reflects models' multi-turn dialogue skills.

Abstract

The advent of Large Language Models (LLMs) has drastically enhanced dialogue systems. However, comprehensively evaluating the dialogue abilities of LLMs remains a challenge. Previous benchmarks have primarily focused on single-turn dialogues or provided coarse-grained and incomplete assessments of multi-turn dialogues, overlooking the complexity and fine-grained nuances of real-life dialogues. To address this issue, we introduce MT-Bench-101, specifically designed to evaluate the fine-grained abilities of LLMs in multi-turn dialogues. By conducting a detailed analysis of real multi-turn dialogue data, we construct a three-tier hierarchical ability taxonomy comprising 4208 turns across 1388 multi-turn dialogues in 13 distinct tasks. We then evaluate 21 popular LLMs based on MT-Bench-101, conducting comprehensive analyses from both ability and task perspectives and observing differing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mtbench101/mt-bench-101
pytorchOfficial

Videos

MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues· underline

Taxonomy

TopicsTopic Modeling