Towards Multiple References Era -- Addressing Data Leakage and Limited   Reference Diversity in NLG Evaluation

Xianfeng Zeng; Yijin Liu; Fandong Meng; Jie Zhou

arXiv:2308.03131·cs.CL·August 11, 2023·1 cites

Towards Multiple References Era -- Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation

Xianfeng Zeng, Yijin Liu, Fandong Meng, Jie Zhou

PDF

Open Access 1 Repo

TL;DR

This paper proposes using multiple references in NLG evaluation metrics to improve their correlation with human judgments and reduce data leakage issues in large language models, showing significant accuracy gains.

Contribution

It introduces a multi-reference approach for NLG evaluation metrics, demonstrating improved correlation with human evaluations and mitigating data leakage in LLMs.

Findings

01

Multi-reference BLEU outperforms single-reference BLEU by 7.2% accuracy.

02

Multi-reference BLEU exceeds BERTscore by 3.9% accuracy.

03

Multi-reference metrics help reduce data leakage in large language models.

Abstract

N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks. However, recent studies have revealed a weak correlation between these matching-based metrics and human evaluations, especially when compared with neural-based metrics like BLEURT. In this paper, we conjecture that the performance bottleneck in matching-based metrics may be caused by the limited diversity of references. To address this issue, we propose to utilize \textit{multiple references} to enhance the consistency between these metrics and human evaluations. Within the WMT Metrics benchmarks, we observe that the multi-references F200spBLEU surpasses the conventional single-reference one by an accuracy improvement of 7.2\%. Remarkably, it also exceeds the neural-based BERTscore by an accuracy enhancement of 3.9\%. Moreover, we observe that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sefazeng/llm-ref
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification