Holistic Evaluation of Language Models
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara, Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya, Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian, Cosgrove, Christopher D. Manning, Christopher R\'e

TL;DR
HELM provides a comprehensive, multi-metric evaluation framework for language models, covering diverse scenarios and models to enhance transparency, identify limitations, and facilitate ongoing improvements in LM assessment.
Contribution
This work introduces a holistic, multi-metric evaluation framework for language models, covering a broad set of scenarios and models, with standardized benchmarks and transparency tools.
Findings
Models evaluated on 96% of core scenarios, up from 17.9%.
Identification of key strengths and weaknesses across models.
Surface of 25 top-level insights about LM capabilities.
Abstract
Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time). This ensures metrics beyond accuracy don't fall to the wayside, and that trade-offs are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
