How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for Token-level Evaluation Metrics
Prasanna Parthasarathi, Joelle Pineau, Sarath Chandar

TL;DR
This paper proposes using specially designed probing tasks to evaluate dialogue systems' understanding, addressing limitations of existing token-level metrics and human evaluations, and reveals that transformer models may not truly understand input despite high output similarity.
Contribution
It introduces a novel probing task framework for evaluating dialogue models' understanding, combining deterministic assessment with human-designed insights.
Findings
Probing tasks reveal transformer models may lack genuine input comprehension.
Automatic token-level metrics do not fully capture model understanding.
Human evaluation can be inconclusive due to insufficient information.
Abstract
Though generative dialogue modeling is widely seen as a language modeling task, the task demands an agent to have a complex natural language understanding of its input text to carry a meaningful interaction with an user. The automatic metrics used evaluate the quality of the generated text as a proxy to the holistic interaction of the agent. Such metrics were earlier shown to not correlate with the human judgement. In this work, we observe that human evaluation of dialogue agents can be inconclusive due to the lack of sufficient information for appropriate evaluation. The automatic metrics are deterministic yet shallow and human evaluation can be relevant yet inconclusive. To bridge this gap in evaluation, we propose designing a set of probing tasks to evaluate dialogue models. The hand-crafted tasks are aimed at quantitatively evaluating a generative dialogue model's understanding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
