Towards Multiple References Era -- Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation
Xianfeng Zeng, Yijin Liu, Fandong Meng, Jie Zhou

TL;DR
This paper proposes using multiple references in NLG evaluation metrics to improve their correlation with human judgments and reduce data leakage issues in large language models, showing significant accuracy gains.
Contribution
It introduces a multi-reference approach for NLG evaluation metrics, demonstrating improved correlation with human evaluations and mitigating data leakage in LLMs.
Findings
Multi-reference BLEU outperforms single-reference BLEU by 7.2% accuracy.
Multi-reference BLEU exceeds BERTscore by 3.9% accuracy.
Multi-reference metrics help reduce data leakage in large language models.
Abstract
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks. However, recent studies have revealed a weak correlation between these matching-based metrics and human evaluations, especially when compared with neural-based metrics like BLEURT. In this paper, we conjecture that the performance bottleneck in matching-based metrics may be caused by the limited diversity of references. To address this issue, we propose to utilize \textit{multiple references} to enhance the consistency between these metrics and human evaluations. Within the WMT Metrics benchmarks, we observe that the multi-references F200spBLEU surpasses the conventional single-reference one by an accuracy improvement of 7.2\%. Remarkably, it also exceeds the neural-based BERTscore by an accuracy enhancement of 3.9\%. Moreover, we observe that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
