Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy   Evaluation Approach

Haoming Jiang; Bo Dai; Mengjiao Yang; Tuo Zhao; Wei Wei

arXiv:2102.10242·cs.CL·September 23, 2021

Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach

Haoming Jiang, Bo Dai, Mengjiao Yang, Tuo Zhao, Wei Wei

PDF

Open Access 1 Repo

TL;DR

This paper introduces ENIGMA, a model-free, off-policy evaluation framework that accurately estimates human evaluation scores for dialogue systems using limited pre-collected data, bypassing the need for human interaction during testing.

Contribution

ENIGMA is the first model-free, off-policy evaluation method for dialogue systems that requires only a small amount of pre-collected data and does not depend on behavior policies.

Findings

01

ENIGMA outperforms existing automatic evaluation methods in correlation with human scores.

02

It requires only limited pre-collected experience data, making large-scale evaluation feasible.

03

ENIGMA is agnostic to behavior policies, simplifying modeling of complex dialogue environments.

Abstract

Reliable automatic evaluation of dialogue systems under an interactive environment has long been overdue. An ideal environment for evaluating dialog systems, also known as the Turing test, needs to involve human interaction, which is usually not affordable for large-scale experiments. Though researchers have attempted to use metrics (e.g., perplexity, BLEU) in language generation tasks or some model-based reinforcement learning methods (e.g., self-play evaluation) for automatic evaluation, these methods only show a very weak correlation with the actual human evaluation in practice. To bridge such a gap, we propose a new framework named ENIGMA for estimating human evaluation scores based on recent advances of off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research/google-research
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Reinforcement Learning in Robotics

MethodsENIGMA