Toward Human-Like Evaluation for Natural Language Generation with Error Analysis
Qingyu Lu, Liang Ding, Liping Xie, Kanjian Zhang, Derek F. Wong,, Dacheng Tao

TL;DR
This paper enhances automatic evaluation metrics for natural language generation by incorporating error analysis, making them more human-like and improving their correlation with human judgments across various NLG tasks.
Contribution
The paper introduces BARTScore++, an improved metric that combines major and minor error analysis to better mimic human evaluation in NLG tasks.
Findings
BARTScore++ outperforms existing metrics in 20 out of 25 test settings.
Incorporating error analysis improves correlation with human judgments.
The method is extendable to other pre-trained model-based metrics.
Abstract
The state-of-the-art language model-based automatic metrics, e.g. BARTScore, benefiting from large-scale contextualized pre-training, have been successfully used in a wide range of natural language generation (NLG) tasks, including machine translation, text summarization, and data-to-text. Recent studies show that considering both major errors (e.g. mistranslated tokens) and minor errors (e.g. imperfections in fluency) can produce high-quality human judgments. This inspires us to approach the final goal of the evaluation metrics (human-like evaluations) by automatic error analysis. To this end, we augment BARTScore by incorporating the human-like error analysis strategies, namely BARTScore++, where the final score consists of both the evaluations of major errors and minor errors. Experimental results show that BARTScore++ can consistently improve the performance of vanilla BARTScore and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsTest
