ClarifyMT-Bench: Benchmarking and Improving Multi-Turn Clarification for Conversational Large Language Models

Sichun Luo; Yi Huang; Mukai Li; Shichang Meng; Fengyuan Liu; Zefa Hu; Junlan Feng; Qi Liu

arXiv:2512.21120·cs.CL·December 25, 2025

ClarifyMT-Bench: Benchmarking and Improving Multi-Turn Clarification for Conversational Large Language Models

Sichun Luo, Yi Huang, Mukai Li, Shichang Meng, Fengyuan Liu, Zefa Hu, Junlan Feng, Qi Liu

PDF

Open Access

TL;DR

This paper introduces ClarifyMT-Bench, a comprehensive benchmark for multi-turn clarification in conversational LLMs, revealing their tendency to under-clarify and proposing ClarifyAgent to improve clarification behavior.

Contribution

It presents a new multi-turn clarification benchmark grounded in a detailed ambiguity taxonomy and diverse simulated user personas, along with a novel ClarifyAgent approach to enhance clarification in LLMs.

Findings

01

LLMs tend to answer prematurely in multi-turn dialogues.

02

Performance degrades as dialogue depth increases.

03

ClarifyAgent significantly improves clarification robustness.

Abstract

Large language models (LLMs) are increasingly deployed as conversational assistants in open-domain, multi-turn settings, where users often provide incomplete or ambiguous information. However, existing LLM-focused clarification benchmarks primarily assume single-turn interactions or cooperative users, limiting their ability to evaluate clarification behavior in realistic settings. We introduce \textbf{ClarifyMT-Bench}, a benchmark for multi-turn clarification grounded in a five-dimensional ambiguity taxonomy and a set of six behaviorally diverse simulated user personas. Through a hybrid LLM-human pipeline, we construct 6,120 multi-turn dialogues capturing diverse ambiguity sources and interaction patterns. Evaluating ten representative LLMs uncovers a consistent under-clarification bias: LLMs tend to answer prematurely, and performance degrades as dialogue depth increases. To mitigate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · AI in Service Interactions · Artificial Intelligence in Healthcare and Education