ReIFE: Re-evaluating Instruction-Following Evaluation

Yixin Liu; Kejian Shi; Alexander R. Fabbri; Yilun Zhao; Peifeng Wang,; Chien-Sheng Wu; Shafiq Joty; Arman Cohan

arXiv:2410.07069·cs.CL·October 10, 2024

ReIFE: Re-evaluating Instruction-Following Evaluation

Yixin Liu, Kejian Shi, Alexander R. Fabbri, Yilun Zhao, Peifeng Wang,, Chien-Sheng Wu, Shafiq Joty, Arman Cohan

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper conducts a comprehensive meta-evaluation of LLM-based instruction-following evaluators across multiple models, protocols, and datasets, revealing insights into their robustness and consistency.

Contribution

It introduces ReIFE, a large-scale meta-evaluation suite for assessing instruction-following evaluators, and provides systematic analysis of evaluation accuracy across various conditions.

Findings

01

Base LLM performance ranking is consistent across protocols.

02

Robust evaluation requires diverse base LLMs and multiple datasets.

03

Evaluation results vary significantly depending on datasets and LLM capabilities.

Abstract

The automatic evaluation of instruction following typically involves using large language models (LLMs) to assess response quality. However, there is a lack of comprehensive evaluation of these LLM-based evaluators across two dimensions: the base LLMs and the evaluation protocols. Therefore, we present a thorough meta-evaluation of instruction following, including 25 base LLMs and 15 recently proposed evaluation protocols, on 4 human-annotated datasets, assessing the evaluation accuracy of the LLM-evaluators. Our evaluation allows us to identify the best-performing base LLMs and evaluation protocols with a high degree of robustness. Moreover, our large-scale evaluation reveals: (1) Base LLM performance ranking remains largely consistent across evaluation protocols, with less capable LLMs showing greater improvement from protocol enhancements; (2) Robust evaluation of evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yale-nlp/ReIFE
pytorchOfficial

Datasets

yale-nlp/ReIFE
dataset· 14 dl
14 dl

Videos

ReIFE: Re-evaluating Instruction-Following Evaluation· underline

Taxonomy

TopicsEducational Tools and Methods

MethodsBalanced Selection