FollowEval: A Multi-Dimensional Benchmark for Assessing the   Instruction-Following Capability of Large Language Models

Yimin Jing; Renren Jin; Jiahao Hu; Huishi Qiu; Xiaohua Wang; Peng; Wang; Deyi Xiong

arXiv:2311.09829·cs.CL·November 17, 2023·1 cites

FollowEval: A Multi-Dimensional Benchmark for Assessing the Instruction-Following Capability of Large Language Models

Yimin Jing, Renren Jin, Jiahao Hu, Huishi Qiu, Xiaohua Wang, Peng, Wang, Deyi Xiong

PDF

Open Access

TL;DR

FollowEval is a comprehensive, multi-dimensional benchmark in English and Chinese, crafted by experts, to evaluate large language models' instruction-following abilities across five key areas, revealing significant performance gaps.

Contribution

This paper introduces FollowEval, a novel benchmark with human-crafted multilingual test examples covering multiple instruction-following dimensions.

Findings

01

LLMs perform significantly worse than humans on FollowEval

02

Benchmark covers five critical instruction-following dimensions

03

Includes multilingual test examples in English and Chinese

Abstract

The effective assessment of the instruction-following ability of large language models (LLMs) is of paramount importance. A model that cannot adhere to human instructions might be not able to provide reliable and helpful responses. In pursuit of this goal, various benchmarks have been constructed to evaluate the instruction-following capacity of these models. However, these benchmarks are limited to a single language and are constructed using automated approaches, which restricts their applicability and the quality of the test examples they contain. To bridge this gap, we introduce the FollowEval benchmark in this paper. This benchmark is composed of instances in both English and Chinese, and all test examples are crafted by human experts. Furthermore, the FollowEval benchmark is designed to assess LLMs across five critical dimensions of instruction following: string manipulation,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Online Learning and Analytics