Correlating and Predicting Human Evaluations of Language Models from   Natural Language Processing Benchmarks

Rylan Schaeffer; Punit Singh Koura; Binh Tang; Ranjan Subramanian,; Aaditya K Singh; Todor Mihaylov; Prajjwal Bhargava; Lovish Madaan; Niladri S.; Chatterji; Vedanuj Goswami; Sergey Edunov; Dieuwke Hupkes; Sanmi Koyejo,; Sharan Narang

arXiv:2502.18339·cs.CL·February 26, 2025

Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks

Rylan Schaeffer, Punit Singh Koura, Binh Tang, Ranjan Subramanian,, Aaditya K Singh, Todor Mihaylov, Prajjwal Bhargava, Lovish Madaan, Niladri S., Chatterji, Vedanuj Goswami, Sergey Edunov, Dieuwke Hupkes, Sanmi Koyejo,, Sharan Narang

PDF

Open Access

TL;DR

This study demonstrates that standard NLP benchmarks can reliably predict human preferences for language models, enabling cost-effective evaluation of conversational AI without extensive human annotation.

Contribution

The paper shows that NLP benchmark scores correlate strongly with human evaluations and can be used to predict human preferences across different model scales.

Findings

01

NLP benchmarks strongly correlate with human preferences

02

Some human evaluations are negatively or uncorrelated with benchmarks

03

NLP scores can predict human evaluations using linear regression

Abstract

The explosion of high-performing conversational language models (LMs) has spurred a shift from classic natural language processing (NLP) benchmarks to expensive, time-consuming and noisy human evaluations - yet the relationship between these two evaluation strategies remains hazy. In this paper, we conduct a large-scale study of four Chat Llama 2 models, comparing their performance on 160 standard NLP benchmarks (e.g., MMLU, ARC, BIG-Bench Hard) against extensive human preferences on more than 11k single-turn and 2k multi-turn dialogues from over 2k human annotators. Our findings are striking: most NLP benchmarks strongly correlate with human evaluations, suggesting that cheaper, automated metrics can serve as surprisingly reliable predictors of human preferences. Three human evaluations, such as adversarial dishonesty and safety, are anticorrelated with NLP benchmarks, while two are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsLLaMA