deltaBLEU: A Discriminative Metric for Generation Tasks with   Intrinsically Diverse Targets

Michel Galley; Chris Brockett; Alessandro Sordoni; Yangfeng Ji,; Michael Auli; Chris Quirk; Margaret Mitchell; Jianfeng Gao; Bill Dolan

arXiv:1506.06863·cs.CL·June 25, 2015·93 cites

deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets

Michel Galley, Chris Brockett, Alessandro Sordoni, Yangfeng Ji,, Michael Auli, Chris Quirk, Margaret Mitchell, Jianfeng Gao, Bill Dolan

PDF

Open Access

TL;DR

deltaBLEU is a new evaluation metric for generated text that accounts for diverse outputs by weighting references with human quality scores, showing better correlation with human judgments than traditional BLEU.

Contribution

The paper introduces deltaBLEU, a discriminative metric that incorporates human-rated reference quality to improve evaluation of diverse text generation tasks.

Findings

01

deltaBLEU correlates better with human judgments than traditional BLEU.

02

It effectively evaluates conversational response generation.

03

Outperforms sentence-level and IBM BLEU in correlation metrics.

Abstract

We introduce Discriminative BLEU (deltaBLEU), a novel metric for intrinsic evaluation of generated text in tasks that admit a diverse range of possible outputs. Reference strings are scored for quality by human raters on a scale of [-1, +1] to weight multi-reference BLEU. In tasks involving generation of conversational responses, deltaBLEU correlates reasonably with human judgments and outperforms sentence-level and IBM BLEU in terms of both Spearman's rho and Kendall's tau.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems