TL;DR
This paper examines how naturalistic variation in user behavior affects the performance of goal-oriented dialog systems, revealing significant drops in accuracy of current models when tested on more realistic, varied conversations.
Contribution
It introduces new testbeds with naturalistic user variation for two datasets and demonstrates the substantial impact on existing neural dialog system performance.
Findings
Performance drops over 60% in Ent. F1 on SMD with natural variation.
Performance drops over 85% in per-dialog accuracy on bAbI with natural variation.
Current state-of-the-art models are less robust to naturalistic conversational variation.
Abstract
Existing benchmarks used to evaluate the performance of end-to-end neural dialog systems lack a key component: natural variation present in human conversations. Most datasets are constructed through crowdsourcing, where the crowd workers follow a fixed template of instructions while enacting the role of a user/agent. This results in straight-forward, somewhat routine, and mostly trouble-free conversations, as crowd workers do not think to represent the full range of actions that occur naturally with real users. In this work, we investigate the impact of naturalistic variation on two goal-oriented datasets: bAbI dialog task and Stanford Multi-Domain Dataset (SMD). We also propose new and more effective testbeds for both datasets, by introducing naturalistic variation by the user. We observe that there is a significant drop in performance (more than 60% in Ent. F1 on SMD and 85% in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
