Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach
Haoming Jiang, Bo Dai, Mengjiao Yang, Tuo Zhao, Wei Wei

TL;DR
This paper introduces ENIGMA, a model-free, off-policy evaluation framework that accurately estimates human evaluation scores for dialogue systems using limited pre-collected data, bypassing the need for human interaction during testing.
Contribution
ENIGMA is the first model-free, off-policy evaluation method for dialogue systems that requires only a small amount of pre-collected data and does not depend on behavior policies.
Findings
ENIGMA outperforms existing automatic evaluation methods in correlation with human scores.
It requires only limited pre-collected experience data, making large-scale evaluation feasible.
ENIGMA is agnostic to behavior policies, simplifying modeling of complex dialogue environments.
Abstract
Reliable automatic evaluation of dialogue systems under an interactive environment has long been overdue. An ideal environment for evaluating dialog systems, also known as the Turing test, needs to involve human interaction, which is usually not affordable for large-scale experiments. Though researchers have attempted to use metrics (e.g., perplexity, BLEU) in language generation tasks or some model-based reinforcement learning methods (e.g., self-play evaluation) for automatic evaluation, these methods only show a very weak correlation with the actual human evaluation in practice. To bridge such a gap, we propose a new framework named ENIGMA for estimating human evaluation scores based on recent advances of off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Reinforcement Learning in Robotics
MethodsENIGMA
