AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Wanjun Zhong; Ruixiang Cui; Yiduo Guo; Yaobo Liang; Shuai Lu; Yanlin; Wang; Amin Saied; Weizhu Chen; Nan Duan

arXiv:2304.06364·cs.CL·September 19, 2023·61 cites

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin, Wang, Amin Saied, Weizhu Chen, Nan Duan

PDF

Open Access 3 Repos 10 Models 5 Datasets

TL;DR

AGIEval introduces a benchmark using human-centric exams to evaluate foundation models' abilities, revealing GPT-4's near-human performance on standardized tests and highlighting areas needing improvement.

Contribution

The paper presents AGIEval, a new benchmark based on human exams, to assess foundation models' real-world capabilities beyond artificial datasets.

Findings

01

GPT-4 achieves 95% on SAT Math and 92.5% on Chinese college English tests.

02

GPT-4 surpasses average human performance on several standardized exams.

03

Models show strengths in understanding and knowledge but struggle with complex reasoning.

Abstract

Evaluating the general abilities of foundation models to tackle human-level tasks is a vital aspect of their development and application in the pursuit of Artificial General Intelligence (AGI). Traditional benchmarks, which rely on artificial datasets, may not accurately represent human-level capabilities. In this paper, we introduce AGIEval, a novel benchmark specifically designed to assess foundation model in the context of human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests. We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003, using this benchmark. Impressively, GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Explainable Artificial Intelligence (XAI)

MethodsAttention Is All You Need · Test · Label Smoothing · Position-Wise Feed-Forward Layer · Softmax · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections