MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark   Challenging to Frontier LLMs

Ved Sirdeshmukh; Kaustubh Deshpande; Johannes Mols; Lifeng Jin,; Ed-Yeremai Cardona; Dean Lee; Jeremy Kritz; Willow Primack; Summer Yue; Chen; Xing

arXiv:2501.17399·cs.CL·March 7, 2025·2 cites

MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs

Ved Sirdeshmukh, Kaustubh Deshpande, Johannes Mols, Lifeng Jin,, Ed-Yeremai Cardona, Dean Lee, Jeremy Kritz, Willow Primack, Summer Yue, Chen, Xing

PDF

Open Access 1 Repo 2 Datasets 1 Video

TL;DR

MultiChallenge is a new benchmark designed to evaluate large language models on complex, multi-turn conversations involving realistic challenges, revealing significant gaps in current models' capabilities despite high scores on existing benchmarks.

Contribution

This paper introduces MultiChallenge, a comprehensive benchmark with novel challenges and an automatic evaluation method, highlighting the limitations of current frontier LLMs in multi-turn conversations.

Findings

01

All frontier LLMs perform below 50% accuracy on MultiChallenge.

02

Existing benchmarks overestimate models' multi-turn conversation abilities.

03

Top model achieves only 41.4% accuracy on the new benchmark.

Abstract

We present MultiChallenge, a pioneering benchmark evaluating large language models (LLMs) on conducting multi-turn conversations with human users, a crucial yet underexamined capability for their applications. MultiChallenge identifies four categories of challenges in multi-turn conversations that are not only common and realistic among current human-LLM interactions, but are also challenging to all current frontier LLMs. All 4 challenges require accurate instruction-following, context allocation, and in-context reasoning at the same time. We also develop LLM as judge with instance-level rubrics to facilitate an automatic evaluation method with fair agreement with experienced human raters. Despite achieving near-perfect scores on existing multi-turn evaluation benchmarks, all frontier models have less than 50% accuracy on MultiChallenge, with the top-performing Claude 3.5 Sonnet (June…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ekwinox117/multi-challenge
noneOfficial

Datasets

Videos

MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs· underline

Taxonomy

TopicsSemantic Web and Ontologies · Service-Oriented Architecture and Web Services · Business Process Modeling and Analysis