Toward Human-Like Evaluation for Natural Language Generation with Error   Analysis

Qingyu Lu; Liang Ding; Liping Xie; Kanjian Zhang; Derek F. Wong,; Dacheng Tao

arXiv:2212.10179·cs.CL·December 21, 2022

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

Qingyu Lu, Liang Ding, Liping Xie, Kanjian Zhang, Derek F. Wong,, Dacheng Tao

PDF

Open Access 1 Repo

TL;DR

This paper enhances automatic evaluation metrics for natural language generation by incorporating error analysis, making them more human-like and improving their correlation with human judgments across various NLG tasks.

Contribution

The paper introduces BARTScore++, an improved metric that combines major and minor error analysis to better mimic human evaluation in NLG tasks.

Findings

01

BARTScore++ outperforms existing metrics in 20 out of 25 test settings.

02

Incorporating error analysis improves correlation with human judgments.

03

The method is extendable to other pre-trained model-based metrics.

Abstract

The state-of-the-art language model-based automatic metrics, e.g. BARTScore, benefiting from large-scale contextualized pre-training, have been successfully used in a wide range of natural language generation (NLG) tasks, including machine translation, text summarization, and data-to-text. Recent studies show that considering both major errors (e.g. mistranslated tokens) and minor errors (e.g. imperfections in fluency) can produce high-quality human judgments. This inspires us to approach the final goal of the evaluation metrics (human-like evaluations) by automatic error analysis. To this end, we augment BARTScore by incorporating the human-like error analysis strategies, namely BARTScore++, where the final score consists of both the evaluations of major errors and minor errors. Experimental results show that BARTScore++ can consistently improve the performance of vanilla BARTScore and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

coldmist-lu/erroranalysis_prompt
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsTest