Meta-evaluation of Conversational Search Evaluation Metrics

Zeyang Liu; Ke Zhou; Max L. Wilson

arXiv:2104.13453·cs.IR·April 29, 2021

Meta-evaluation of Conversational Search Evaluation Metrics

Zeyang Liu, Ke Zhou, Max L. Wilson

PDF

1 Repo

TL;DR

This paper conducts a comprehensive meta-evaluation of various metrics for assessing conversational search systems, revealing their limitations and proposing session-based metrics for better multi-turn evaluation.

Contribution

It provides the most extensive meta-evaluation of conversational search metrics, analyzing their reliability, fidelity, and intuitiveness, and introduces session-based evaluation approaches.

Findings

01

Existing metrics show weak correlation with user satisfaction.

02

METEOR performs best among single-turn metrics.

03

Session-based metrics moderately align with user satisfaction.

Abstract

Conversational search systems, such as Google Assistant and Microsoft Cortana, enable users to interact with search systems in multiple rounds through natural language dialogues. Evaluating such systems is very challenging given that any natural language responses could be generated, and users commonly interact for multiple semantically coherent rounds to accomplish a search task. Although prior studies proposed many evaluation metrics, the extent of how those measures effectively capture user preference remains to be investigated. In this paper, we systematically meta-evaluate a variety of conversational search metrics. We specifically study three perspectives on those metrics: (1) reliability: the ability to detect "actual" performance differences as opposed to those observed by chance; (2) fidelity: the ability to agree with ultimate user preference; and (3) intuitiveness: the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

poethan/LEPOR
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.