TL;DR
This paper conducts a comprehensive meta-evaluation of various metrics for assessing conversational search systems, revealing their limitations and proposing session-based metrics for better multi-turn evaluation.
Contribution
It provides the most extensive meta-evaluation of conversational search metrics, analyzing their reliability, fidelity, and intuitiveness, and introduces session-based evaluation approaches.
Findings
Existing metrics show weak correlation with user satisfaction.
METEOR performs best among single-turn metrics.
Session-based metrics moderately align with user satisfaction.
Abstract
Conversational search systems, such as Google Assistant and Microsoft Cortana, enable users to interact with search systems in multiple rounds through natural language dialogues. Evaluating such systems is very challenging given that any natural language responses could be generated, and users commonly interact for multiple semantically coherent rounds to accomplish a search task. Although prior studies proposed many evaluation metrics, the extent of how those measures effectively capture user preference remains to be investigated. In this paper, we systematically meta-evaluate a variety of conversational search metrics. We specifically study three perspectives on those metrics: (1) reliability: the ability to detect "actual" performance differences as opposed to those observed by chance; (2) fidelity: the ability to agree with ultimate user preference; and (3) intuitiveness: the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
