Evaluating Large Language Models: A Comprehensive Survey

Zishan Guo; Renren Jin; Chuang Liu; Yufei Huang; Dan Shi; Supryadi,; Linhao Yu; Yan Liu; Jiaxuan Li; Bojian Xiong; Deyi Xiong

arXiv:2310.19736·cs.CL·November 28, 2023·61 cites

Evaluating Large Language Models: A Comprehensive Survey

Zishan Guo, Renren Jin, Chuang Liu, Yufei Huang, Dan Shi, Supryadi,, Linhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong, Deyi Xiong

PDF

Open Access 1 Repo

TL;DR

This survey provides a comprehensive overview of methods, benchmarks, and evaluations for large language models, emphasizing their capabilities, alignment, safety, and responsible development to maximize societal benefits.

Contribution

It categorizes and reviews evaluation methodologies for LLMs across capabilities, alignment, and safety, and discusses building comprehensive evaluation platforms.

Findings

01

Reviewed diverse evaluation benchmarks and methodologies.

02

Highlighted the importance of safety and alignment assessments.

03

Provided a curated list of related evaluation research.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across a broad spectrum of tasks. They have attracted significant attention and been deployed in numerous downstream applications. Nevertheless, akin to a double-edged sword, LLMs also present potential risks. They could suffer from private data leaks or yield inappropriate, harmful, or misleading content. Additionally, the rapid progress of LLMs raises concerns about the potential emergence of superintelligent systems without adequate safeguards. To effectively capitalize on LLM capacities as well as ensure their safe and beneficial development, it is critical to conduct a rigorous and comprehensive evaluation of LLMs. This survey endeavors to offer a panoramic perspective on the evaluation of LLMs. We categorize the evaluation of LLMs into three major groups: knowledge and capability evaluation, alignment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tjunlp-lab/awesome-llms-evaluation-papers
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Natural Language Processing Techniques