Holistic Evaluation of Language Models

Percy Liang; Rishi Bommasani; Tony Lee; Dimitris Tsipras; Dilara; Soylu; Michihiro Yasunaga; Yian Zhang; Deepak Narayanan; Yuhuai Wu; Ananya; Kumar; Benjamin Newman; Binhang Yuan; Bobby Yan; Ce Zhang; Christian; Cosgrove; Christopher D. Manning; Christopher R\'e; Diana Acosta-Navas; Drew; A. Hudson; Eric Zelikman; Esin Durmus; Faisal Ladhak; Frieda Rong; Hongyu; Ren; Huaxiu Yao; Jue Wang; Keshav Santhanam; Laurel Orr; Lucia Zheng; Mert; Yuksekgonul; Mirac Suzgun; Nathan Kim; Neel Guha; Niladri Chatterji; Omar; Khattab; Peter Henderson; Qian Huang; Ryan Chi; Sang Michael Xie; Shibani; Santurkar; Surya Ganguli; Tatsunori Hashimoto; Thomas Icard; Tianyi Zhang,; Vishrav Chaudhary; William Wang; Xuechen Li; Yifan Mai; Yuhui Zhang; Yuta; Koreeda

arXiv:2211.09110·cs.CL·October 3, 2023·119 cites

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara, Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya, Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian, Cosgrove, Christopher D. Manning, Christopher R\'e

PDF

Open Access 3 Repos 4 Models 4 Datasets

TL;DR

HELM provides a comprehensive, multi-metric evaluation framework for language models, covering diverse scenarios and models to enhance transparency, identify limitations, and facilitate ongoing improvements in LM assessment.

Contribution

This work introduces a holistic, multi-metric evaluation framework for language models, covering a broad set of scenarios and models, with standardized benchmarks and transparency tools.

Findings

01

Models evaluated on 96% of core scenarios, up from 17.9%.

02

Identification of key strengths and weaknesses across models.

03

Surface of 25 top-level insights about LM capabilities.

Abstract

Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time). This ensures metrics beyond accuracy don't fall to the wayside, and that trade-offs are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques