ClarifyMT-Bench: Benchmarking and Improving Multi-Turn Clarification for Conversational Large Language Models
Sichun Luo, Yi Huang, Mukai Li, Shichang Meng, Fengyuan Liu, Zefa Hu, Junlan Feng, Qi Liu

TL;DR
This paper introduces ClarifyMT-Bench, a comprehensive benchmark for multi-turn clarification in conversational LLMs, revealing their tendency to under-clarify and proposing ClarifyAgent to improve clarification behavior.
Contribution
It presents a new multi-turn clarification benchmark grounded in a detailed ambiguity taxonomy and diverse simulated user personas, along with a novel ClarifyAgent approach to enhance clarification in LLMs.
Findings
LLMs tend to answer prematurely in multi-turn dialogues.
Performance degrades as dialogue depth increases.
ClarifyAgent significantly improves clarification robustness.
Abstract
Large language models (LLMs) are increasingly deployed as conversational assistants in open-domain, multi-turn settings, where users often provide incomplete or ambiguous information. However, existing LLM-focused clarification benchmarks primarily assume single-turn interactions or cooperative users, limiting their ability to evaluate clarification behavior in realistic settings. We introduce \textbf{ClarifyMT-Bench}, a benchmark for multi-turn clarification grounded in a five-dimensional ambiguity taxonomy and a set of six behaviorally diverse simulated user personas. Through a hybrid LLM-human pipeline, we construct 6,120 multi-turn dialogues capturing diverse ambiguity sources and interaction patterns. Evaluating ten representative LLMs uncovers a consistent under-clarification bias: LLMs tend to answer prematurely, and performance degrades as dialogue depth increases. To mitigate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · AI in Service Interactions · Artificial Intelligence in Healthcare and Education
