HREF: Human Response-Guided Evaluation of Instruction Following in   Language Models

Xinxi Lyu; Yizhong Wang; Hannaneh Hajishirzi; Pradeep Dasigi

arXiv:2412.15524·cs.CL·March 26, 2025

HREF: Human Response-Guided Evaluation of Instruction Following in Language Models

Xinxi Lyu, Yizhong Wang, Hannaneh Hajishirzi, Pradeep Dasigi

PDF

Open Access 1 Repo 2 Datasets

TL;DR

This paper introduces HREF, a new human response-guided evaluation benchmark for instruction-following in language models, improving reliability and reducing bias compared to traditional LLM-based assessments.

Contribution

It develops a novel evaluation benchmark using human-written responses, demonstrating enhanced agreement with human judgments and providing a comprehensive, bias-reduced assessment framework.

Findings

01

Human responses improve evaluation reliability by up to 3.2%

02

HREF covers 11 task categories with 4,258 samples

03

Evaluation setup is free from contamination and emphasizes individual task performance

Abstract

Evaluating the capability of Large Language Models (LLMs) in following instructions has heavily relied on a powerful LLM as the judge, introducing unresolved biases that deviate the judgments from human judges. In this work, we reevaluate various choices for automatic evaluation on a wide range of instruction-following tasks. We experiment with methods that leverage human-written responses and observe that they enhance the reliability of automatic evaluations across a wide range of tasks, resulting in up to a 3.2% improvement in agreement with human judges. We also discovered that human-written responses offer an orthogonal perspective to model-generated responses in following instructions and should be used as an additional context when comparing model responses. Based on these observations, we develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

allenai/href
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Intelligent Tutoring Systems and Adaptive Learning

MethodsSparse Evolutionary Training