deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets
Michel Galley, Chris Brockett, Alessandro Sordoni, Yangfeng Ji,, Michael Auli, Chris Quirk, Margaret Mitchell, Jianfeng Gao, Bill Dolan

TL;DR
deltaBLEU is a new evaluation metric for generated text that accounts for diverse outputs by weighting references with human quality scores, showing better correlation with human judgments than traditional BLEU.
Contribution
The paper introduces deltaBLEU, a discriminative metric that incorporates human-rated reference quality to improve evaluation of diverse text generation tasks.
Findings
deltaBLEU correlates better with human judgments than traditional BLEU.
It effectively evaluates conversational response generation.
Outperforms sentence-level and IBM BLEU in correlation metrics.
Abstract
We introduce Discriminative BLEU (deltaBLEU), a novel metric for intrinsic evaluation of generated text in tasks that admit a diverse range of possible outputs. Reference strings are scored for quality by human raters on a scale of [-1, +1] to weight multi-reference BLEU. In tasks involving generation of conversational responses, deltaBLEU correlates reasonably with human judgments and outperforms sentence-level and IBM BLEU in terms of both Spearman's rho and Kendall's tau.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
