Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy, Basu, Yi Luan, Denny Zhou, Le Hou

TL;DR
This paper introduces IFEval, a standardized, reproducible benchmark for evaluating large language models' ability to follow natural language instructions, addressing limitations of human and LLM-based evaluations.
Contribution
The paper presents IFEval, a new benchmark with verifiable instructions and a diverse set of prompts, enabling consistent and objective evaluation of LLM instruction-following capabilities.
Findings
Two widely available LLMs evaluated on IFEval.
IFEval provides a reproducible and objective assessment.
Benchmark covers 25 instruction types with 500 prompts.
Abstract
One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/gemma-3-270mmodel· 83k dl· ♡ 100383k dl♡ 1003
- 🤗google/gemma-3-270m-itmodel· 111k dl· ♡ 569111k dl♡ 569
- 🤗Sahabat-AI/Llama-Sahabat-AI-v2-70B-ITmodel· 110 dl· ♡ 13110 dl♡ 13
- 🤗unsloth/gemma-3-270m-itmodel· 24k dl· ♡ 2324k dl♡ 23
- 🤗unsloth/gemma-3-270m-it-GGUFmodel· 69k dl· ♡ 15869k dl♡ 158
- 🤗litert-community/gemma-3-270m-itmodel· 2.1k dl· ♡ 432.1k dl♡ 43
- 🤗p-e-w/gemma-3-270m-it-hereticmodel· 327 dl· ♡ 13327 dl♡ 13
- 🤗HuggingFaceH4/starchat2-15b-v0.1model· 92 dl· ♡ 11292 dl♡ 112
- 🤗HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1model· 82 dl· ♡ 26982 dl♡ 269
- 🤗blockblockblock/zephyr-orpo-141b-A35b-v0.1-bpw2.25model· 2 dl· ♡ 12 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsSparse Evolutionary Training
