HumanRankEval: Automatic Evaluation of LMs as Conversational Assistants

Milan Gritta; Gerasimos Lampouras; Ignacio Iacobacci

arXiv:2405.09186·cs.CL·May 16, 2024

HumanRankEval: Automatic Evaluation of LMs as Conversational Assistants

Milan Gritta, Gerasimos Lampouras, Ignacio Iacobacci

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

HumanRankEval is a new automatic evaluation method for conversational language models that uses human-annotated answer sets and measures how well model rankings align with human judgments, aiding development of better LMs.

Contribution

It introduces HumanRankEval, a scalable, automatic evaluation framework that correlates well with human judgments and detects improvements from instruction-tuning.

Findings

01

HRE correlates strongly with human judgments.

02

HRE effectively distinguishes between pretrained and instruction-tuned LMs.

03

HRE is sensitive to model improvements after instruction-tuning.

Abstract

Language models (LMs) as conversational assistants recently became popular tools that help people accomplish a variety of tasks. These typically result from adapting LMs pretrained on general domain text sequences through further instruction-tuning and possibly preference optimisation methods. The evaluation of such LMs would ideally be performed using human judgement, however, this is not scalable. On the other hand, automatic evaluation featuring auxiliary LMs as judges and/or knowledge-based tasks is scalable but struggles with assessing conversational ability and adherence to instructions. To help accelerate the development of LMs as conversational assistants, we propose a novel automatic evaluation task: HumanRankEval (HRE). It consists of a large-scale, diverse and high-quality set of questions, each with several answers authored and scored by humans. To perform evaluation, HRE…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

huawei-noah/noah-research/tree/master/NLP/HumanRankEval
pytorchOfficial

Datasets

huawei-noah/human_rank_eval
dataset· 293 dl
293 dl

Videos

HumanRankEval: Automatic Evaluation of LMs as Conversational Assistants· underline

Taxonomy

TopicsSpeech and dialogue systems · AI in Service Interactions · Text Readability and Simplification

MethodsSparse Evolutionary Training