SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li; Rui Wang; Guangzhi Wang; Yuying Ge; Yixiao Ge; Ying Shan

arXiv:2307.16125·cs.CL·August 3, 2023·52 cites

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, Ying Shan

PDF

Open Access 3 Repos 5 Datasets

TL;DR

SEED-Bench is a comprehensive benchmark with 19,000 multiple-choice questions designed to evaluate multimodal large language models' understanding of images and videos across 12 dimensions, aiming to advance generative comprehension assessment.

Contribution

The paper introduces SEED-Bench, a large-scale, multi-dimensional benchmark with an advanced question generation pipeline for evaluating multimodal LLMs' generative comprehension.

Findings

01

Evaluated 18 models across 12 dimensions revealing current limitations.

02

Benchmark enables objective, automated assessment without human intervention.

03

Provides a platform for ongoing community evaluation and research.

Abstract

Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation. In this work, we address the evaluation of generative comprehension in MLLMs as a preliminary step towards a comprehensive assessment of generative models, by introducing a benchmark named SEED-Bench. SEED-Bench consists of 19K multiple choice questions with accurate human annotations (x 6 larger than existing benchmarks), which spans 12 evaluation dimensions including the comprehension of both the image and video modality. We develop an advanced pipeline for generating multiple-choice questions that target specific evaluation dimensions, integrating both automatic filtering and manual verification processes. Multiple-choice questions with groundtruth options…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods

MethodsMulti-Head Attention · Attention Is All You Need · Cosine Annealing · Linear Layer · Discriminative Fine-Tuning · Dropout · Linear Warmup With Cosine Annealing · Adam · Attention Dropout · Byte Pair Encoding