AGIBench: A Multi-granularity, Multimodal, Human-referenced,   Auto-scoring Benchmark for Large Language Models

Fei Tang; Wanling Gao; Luzhou Peng; Jianfeng Zhan

arXiv:2309.06495·cs.CL·September 14, 2023

AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models

Fei Tang, Wanling Gao, Luzhou Peng, Jianfeng Zhan

PDF

Open Access

TL;DR

AGIBench is a comprehensive benchmarking framework for large language models that evaluates multi-granularity, multimodal, human-referenced question-solving abilities with auto-scoring and diverse metrics.

Contribution

It introduces a novel multi-granularity, multimodal, human-referenced, auto-scoring benchmark for LLMs, addressing diverse ability branches and input modalities.

Findings

01

Supports multi-granularity evaluation levels

02

Incorporates multimodal inputs including text and images

03

Provides auto-scoring and diverse performance metrics

Abstract

Large language models (LLMs) like ChatGPT have revealed amazing intelligence. How to evaluate the question-solving abilities of LLMs and their degrees of intelligence is a hot-spot but challenging issue. First, the question-solving abilities are interlaced with different ability branches like understanding and massive knowledge categories like mathematics. Second, the inputs of questions are multimodal that may involve text and images. Third, the response format of LLMs is diverse and thus poses great challenges for result extraction and evaluation. In this paper, we propose AGIBench -- a multi-granularity, multimodal, human-referenced, and auto-scoring benchmarking methodology for LLMs. Instead of a collection of blended questions, AGIBench focuses on three typical ability branches and adopts a four-tuple <ability branch, knowledge, difficulty, modal> to label the attributes of each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification