Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of Current Evaluation Protocols
Sarah E. Finch, Jinho D. Choi

TL;DR
This paper analyzes current dialogue system evaluation protocols, highlighting their limitations and proposing a comprehensive framework based on recent research and expert assessment to improve fairness and effectiveness.
Contribution
It provides a systematic review of automated, static, and interactive evaluation methods, identifying gaps and suggesting a unified evaluation approach for dialogue systems.
Findings
Current protocols have significant shortcomings in assessing dialogue quality.
Automated and human evaluations often lack consistency and comprehensiveness.
Expert evaluation reveals key dimensions missing in existing protocols.
Abstract
As conversational AI-based dialogue management has increasingly become a trending topic, the need for a standardized and reliable evaluation procedure grows even more pressing. The current state of affairs suggests various evaluation protocols to assess chat-oriented dialogue management systems, rendering it difficult to conduct fair comparative studies across different approaches and gain an insightful understanding of their values. To foster this research, a more robust evaluation protocol must be set in place. This paper presents a comprehensive synthesis of both automated and human evaluation methods on dialogue systems, identifying their shortcomings while accumulating evidence towards the most effective evaluation dimensions. A total of 20 papers from the last two years are surveyed to analyze three types of evaluation protocols: automated, static, and interactive. Finally, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
