Defining and Detecting Vulnerability in Human Evaluation Guidelines: A   Preliminary Study Towards Reliable NLG Evaluation

Jie Ruan; Wenqing Wang; Xiaojun Wan

arXiv:2406.07935·cs.CL·June 13, 2024

Defining and Detecting Vulnerability in Human Evaluation Guidelines: A Preliminary Study Towards Reliable NLG Evaluation

Jie Ruan, Wenqing Wang, Xiaojun Wan

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper highlights the vulnerabilities in human evaluation guidelines for NLG systems, introduces a dataset and detection method for these vulnerabilities, and offers recommendations to improve evaluation reliability.

Contribution

It presents the first dataset of human evaluation guidelines, a taxonomy of vulnerabilities, and a method using LLMs to detect these vulnerabilities, advancing reliable NLG evaluation.

Findings

01

77.09% of guidelines have vulnerabilities

02

29.84% of papers release evaluation guidelines

03

Proposed detection method effectively identifies vulnerabilities

Abstract

Human evaluation serves as the gold standard for assessing the quality of Natural Language Generation (NLG) systems. Nevertheless, the evaluation guideline, as a pivotal element ensuring reliable and reproducible human assessment, has received limited attention.Our investigation revealed that only 29.84% of recent papers involving human evaluation at top conferences release their evaluation guidelines, with vulnerabilities identified in 77.09% of these guidelines. Unreliable evaluation guidelines can yield inaccurate assessment outcomes, potentially impeding the advancement of NLG in the right direction. To address these challenges, we take an initial step towards reliable evaluation guidelines and propose the first human evaluation guideline dataset by collecting annotations of guidelines extracted from existing papers as well as generated via Large Language Models (LLMs). We then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

EnablerRx/GuidelineVulnDetect
noneOfficial

Videos

Defining and Detecting Vulnerability in Human Evaluation Guidelines: A Preliminary Study Towards Reliable NLG Evaluation· underline

Taxonomy

TopicsHealthcare Systems and Practices · Health, Medicine and Society

MethodsSparse Evolutionary Training