ReIFE: Re-evaluating Instruction-Following Evaluation
Yixin Liu, Kejian Shi, Alexander R. Fabbri, Yilun Zhao, Peifeng Wang,, Chien-Sheng Wu, Shafiq Joty, Arman Cohan

TL;DR
This paper conducts a comprehensive meta-evaluation of LLM-based instruction-following evaluators across multiple models, protocols, and datasets, revealing insights into their robustness and consistency.
Contribution
It introduces ReIFE, a large-scale meta-evaluation suite for assessing instruction-following evaluators, and provides systematic analysis of evaluation accuracy across various conditions.
Findings
Base LLM performance ranking is consistent across protocols.
Robust evaluation requires diverse base LLMs and multiple datasets.
Evaluation results vary significantly depending on datasets and LLM capabilities.
Abstract
The automatic evaluation of instruction following typically involves using large language models (LLMs) to assess response quality. However, there is a lack of comprehensive evaluation of these LLM-based evaluators across two dimensions: the base LLMs and the evaluation protocols. Therefore, we present a thorough meta-evaluation of instruction following, including 25 base LLMs and 15 recently proposed evaluation protocols, on 4 human-annotated datasets, assessing the evaluation accuracy of the LLM-evaluators. Our evaluation allows us to identify the best-performing base LLMs and evaluation protocols with a high degree of robustness. Moreover, our large-scale evaluation reveals: (1) Base LLM performance ranking remains largely consistent across evaluation protocols, with less capable LLMs showing greater improvement from protocol enhancements; (2) Robust evaluation of evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsEducational Tools and Methods
MethodsBalanced Selection
