HREF: Human Response-Guided Evaluation of Instruction Following in Language Models
Xinxi Lyu, Yizhong Wang, Hannaneh Hajishirzi, Pradeep Dasigi

TL;DR
This paper introduces HREF, a new human response-guided evaluation benchmark for instruction-following in language models, improving reliability and reducing bias compared to traditional LLM-based assessments.
Contribution
It develops a novel evaluation benchmark using human-written responses, demonstrating enhanced agreement with human judgments and providing a comprehensive, bias-reduced assessment framework.
Findings
Human responses improve evaluation reliability by up to 3.2%
HREF covers 11 task categories with 4,258 samples
Evaluation setup is free from contamination and emphasizes individual task performance
Abstract
Evaluating the capability of Large Language Models (LLMs) in following instructions has heavily relied on a powerful LLM as the judge, introducing unresolved biases that deviate the judgments from human judges. In this work, we reevaluate various choices for automatic evaluation on a wide range of instruction-following tasks. We experiment with methods that leverage human-written responses and observe that they enhance the reliability of automatic evaluations across a wide range of tasks, resulting in up to a 3.2% improvement in agreement with human judges. We also discovered that human-written responses offer an orthogonal perspective to model-generated responses in following instructions and should be used as an additional context when comparing model responses. Based on these observations, we develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Intelligent Tutoring Systems and Adaptive Learning
MethodsSparse Evolutionary Training
