SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with   Text-Rich Visual Comprehension

Bohao Li; Yuying Ge; Yi Chen; Yixiao Ge; Ruimao Zhang; Ying Shan

arXiv:2404.16790·cs.CV·April 26, 2024·2 cites

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, Ying Shan

PDF

Open Access 2 Repos

TL;DR

SEED-Bench-2-Plus is a new benchmark with 2.3K questions designed to evaluate the ability of multimodal large language models to understand text-rich images like charts, maps, and webs, highlighting current limitations.

Contribution

This work introduces SEED-Bench-2-Plus, the first comprehensive benchmark specifically targeting text-rich visual comprehension in MLLMs, covering diverse real-world scenarios.

Findings

01

Current MLLMs show limitations in text-rich visual comprehension.

02

The benchmark reveals gaps in models like GPT-4V, Gemini-Pro-Vision, and Claude-3-Opus.

03

Provides a new standard for evaluating text-rich visual understanding.

Abstract

Comprehending text-rich visual content is paramount for the practical application of Multimodal Large Language Models (MLLMs), since text-rich scenarios are ubiquitous in the real world, which are characterized by the presence of extensive texts embedded within images. Recently, the advent of MLLMs with impressive versatility has raised the bar for what we can expect from MLLMs. However, their proficiency in text-rich scenarios has yet to be comprehensively and objectively assessed, since current MLLM benchmarks primarily focus on evaluating general visual comprehension. In this work, we introduce SEED-Bench-2-Plus, a benchmark specifically designed for evaluating \textbf{text-rich visual comprehension} of MLLMs. Our benchmark comprises 2.3K multiple-choice questions with precise human annotations, spanning three broad categories: Charts, Maps, and Webs, each of which covers a wide…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsFocus