HumanRankEval: Automatic Evaluation of LMs as Conversational Assistants
Milan Gritta, Gerasimos Lampouras, Ignacio Iacobacci

TL;DR
HumanRankEval is a new automatic evaluation method for conversational language models that uses human-annotated answer sets and measures how well model rankings align with human judgments, aiding development of better LMs.
Contribution
It introduces HumanRankEval, a scalable, automatic evaluation framework that correlates well with human judgments and detects improvements from instruction-tuning.
Findings
HRE correlates strongly with human judgments.
HRE effectively distinguishes between pretrained and instruction-tuned LMs.
HRE is sensitive to model improvements after instruction-tuning.
Abstract
Language models (LMs) as conversational assistants recently became popular tools that help people accomplish a variety of tasks. These typically result from adapting LMs pretrained on general domain text sequences through further instruction-tuning and possibly preference optimisation methods. The evaluation of such LMs would ideally be performed using human judgement, however, this is not scalable. On the other hand, automatic evaluation featuring auxiliary LMs as judges and/or knowledge-based tasks is scalable but struggles with assessing conversational ability and adherence to instructions. To help accelerate the development of LMs as conversational assistants, we propose a novel automatic evaluation task: HumanRankEval (HRE). It consists of a large-scale, diverse and high-quality set of questions, each with several answers authored and scored by humans. To perform evaluation, HRE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech and dialogue systems · AI in Service Interactions · Text Readability and Simplification
MethodsSparse Evolutionary Training
