On Evaluating and Comparing Open Domain Dialog Systems

Anu Venkatesh; Chandra Khatri; Ashwin Ram; Fenfei Guo; Raefer Gabriel,; Ashish Nagar; Rohit Prasad; Ming Cheng; Behnam Hedayatnia; Angeliki; Metallinou; Rahul Goel; Shaohua Yang; Anirudh Raju

arXiv:1801.03625·cs.CL·December 31, 2018·24 cites

On Evaluating and Comparing Open Domain Dialog Systems

Anu Venkatesh, Chandra Khatri, Ashwin Ram, Fenfei Guo, Raefer Gabriel,, Ashish Nagar, Rohit Prasad, Ming Cheng, Behnam Hedayatnia, Angeliki, Metallinou, Rahul Goel, Shaohua Yang, Anirudh Raju

PDF

Open Access

TL;DR

This paper presents a comprehensive, multi-metric evaluation strategy for open domain dialog systems, aiming to reduce subjectivity and better correlate with human judgments, based on data from the Alexa Prize competition.

Contribution

It introduces a novel multi-metric evaluation framework that provides granular analysis and a unified scoring mechanism for conversational agents, leveraging large-scale real-world data.

Findings

01

Metrics correlate well with human judgment

02

Proposed evaluation reduces subjectivity in assessments

03

Framework applied successfully in Alexa Prize competition

Abstract

Conversational agents are exploding in popularity. However, much work remains in the area of non goal-oriented conversations, despite significant growth in research interest over recent years. To advance the state of the art in conversational AI, Amazon launched the Alexa Prize, a 2.5-million dollar university competition where sixteen selected university teams built conversational agents to deliver the best social conversational experience. Alexa Prize provided the academic community with the unique opportunity to perform research with a live system used by millions of users. The subjectivity associated with evaluating conversations is key element underlying the challenge of building non-goal oriented dialogue systems. In this paper, we propose a comprehensive evaluation strategy with multiple metrics designed to reduce subjectivity by selecting metrics which correlate well with human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · AI in Service Interactions · Speech and dialogue systems