SEED-Bench-2: Benchmarking Multimodal Large Language Models
Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang,, Ying Shan

TL;DR
SEED-Bench-2 introduces a comprehensive hierarchical benchmark with 24,000 questions to evaluate the capabilities of multimodal large language models across multiple dimensions, highlighting current limitations and guiding future research.
Contribution
This work presents SEED-Bench-2, a novel benchmark that assesses the hierarchical multimodal capabilities of MLLMs using a large, multi-dimensional, multiple-choice dataset with objective evaluation methods.
Findings
Evaluated 23 open-source MLLMs revealing their limitations.
Benchmark covers 27 dimensions including text and image generation.
Objective assessment method eliminates need for human or GPT intervention.
Abstract
Multimodal large language models (MLLMs), building upon the foundation of powerful large language models (LLMs), have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs (acting like a combination of GPT-4V and DALL-E 3). However, existing MLLM benchmarks remain limited to assessing only models' comprehension ability of single image-text inputs, failing to keep up with the strides made in MLLMs. A comprehensive benchmark is imperative for investigating the progress and uncovering the limitations of current MLLMs. In this work, we categorize the capabilities of MLLMs into hierarchical levels from to based on the modalities they can accept and generate, and propose SEED-Bench-2, a comprehensive benchmark that evaluates the \textbf{hierarchical} capabilities of MLLMs. Specifically, SEED-Bench-2…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Discriminative Fine-Tuning · Attention Dropout · Adam · Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · Residual Connection
