How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for   Token-level Evaluation Metrics

Prasanna Parthasarathi; Joelle Pineau; Sarath Chandar

arXiv:2008.10427·cs.CL·August 25, 2020·5 cites

How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for Token-level Evaluation Metrics

Prasanna Parthasarathi, Joelle Pineau, Sarath Chandar

PDF

Open Access 1 Repo

TL;DR

This paper proposes using specially designed probing tasks to evaluate dialogue systems' understanding, addressing limitations of existing token-level metrics and human evaluations, and reveals that transformer models may not truly understand input despite high output similarity.

Contribution

It introduces a novel probing task framework for evaluating dialogue models' understanding, combining deterministic assessment with human-designed insights.

Findings

01

Probing tasks reveal transformer models may lack genuine input comprehension.

02

Automatic token-level metrics do not fully capture model understanding.

03

Human evaluation can be inconclusive due to insufficient information.

Abstract

Though generative dialogue modeling is widely seen as a language modeling task, the task demands an agent to have a complex natural language understanding of its input text to carry a meaningful interaction with an user. The automatic metrics used evaluate the quality of the generated text as a proxy to the holistic interaction of the agent. Such metrics were earlier shown to not correlate with the human judgement. In this work, we observe that human evaluation of dialogue agents can be inconclusive due to the lack of sufficient information for appropriate evaluation. The automatic metrics are deterministic yet shallow and human evaluation can be relevant yet inconclusive. To bridge this gap in evaluation, we propose designing a set of probing tasks to evaluate dialogue models. The hand-crafted tasks are aimed at quantitatively evaluating a generative dialogue model's understanding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ppartha03/Dialogue-Probe-Tasks-Public
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems