Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou; Tianjian Lu; Swaroop Mishra; Siddhartha Brahma; Sujoy; Basu; Yi Luan; Denny Zhou; Le Hou

arXiv:2311.07911·cs.CL·November 15, 2023·27 cites

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy, Basu, Yi Luan, Denny Zhou, Le Hou

PDF

Open Access 4 Repos 10 Models 5 Datasets

TL;DR

This paper introduces IFEval, a standardized, reproducible benchmark for evaluating large language models' ability to follow natural language instructions, addressing limitations of human and LLM-based evaluations.

Contribution

The paper presents IFEval, a new benchmark with verifiable instructions and a diverse set of prompts, enabling consistent and objective evaluation of LLM instruction-following capabilities.

Findings

01

Two widely available LLMs evaluated on IFEval.

02

IFEval provides a reproducible and objective assessment.

03

Benchmark covers 25 instruction types with 500 prompts.

Abstract

One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsSparse Evolutionary Training