SEED-Bench-2: Benchmarking Multimodal Large Language Models

Bohao Li; Yuying Ge; Yixiao Ge; Guangzhi Wang; Rui Wang; Ruimao Zhang,; Ying Shan

arXiv:2311.17092·cs.CV·November 30, 2023·6 cites

SEED-Bench-2: Benchmarking Multimodal Large Language Models

Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang,, Ying Shan

PDF

Open Access 2 Repos 3 Datasets

TL;DR

SEED-Bench-2 introduces a comprehensive hierarchical benchmark with 24,000 questions to evaluate the capabilities of multimodal large language models across multiple dimensions, highlighting current limitations and guiding future research.

Contribution

This work presents SEED-Bench-2, a novel benchmark that assesses the hierarchical multimodal capabilities of MLLMs using a large, multi-dimensional, multiple-choice dataset with objective evaluation methods.

Findings

01

Evaluated 23 open-source MLLMs revealing their limitations.

02

Benchmark covers 27 dimensions including text and image generation.

03

Objective assessment method eliminates need for human or GPT intervention.

Abstract

Multimodal large language models (MLLMs), building upon the foundation of powerful large language models (LLMs), have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs (acting like a combination of GPT-4V and DALL-E 3). However, existing MLLM benchmarks remain limited to assessing only models' comprehension ability of single image-text inputs, failing to keep up with the strides made in MLLMs. A comprehensive benchmark is imperative for investigating the progress and uncovering the limitations of current MLLMs. In this work, we categorize the capabilities of MLLMs into hierarchical levels from $L_{0}$ to $L_{4}$ based on the modalities they can accept and generate, and propose SEED-Bench-2, a comprehensive benchmark that evaluates the \textbf{hierarchical} capabilities of MLLMs. Specifically, SEED-Bench-2…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Discriminative Fine-Tuning · Attention Dropout · Adam · Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · Residual Connection