SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension
Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, Ying Shan

TL;DR
SEED-Bench-2-Plus is a new benchmark with 2.3K questions designed to evaluate the ability of multimodal large language models to understand text-rich images like charts, maps, and webs, highlighting current limitations.
Contribution
This work introduces SEED-Bench-2-Plus, the first comprehensive benchmark specifically targeting text-rich visual comprehension in MLLMs, covering diverse real-world scenarios.
Findings
Current MLLMs show limitations in text-rich visual comprehension.
The benchmark reveals gaps in models like GPT-4V, Gemini-Pro-Vision, and Claude-3-Opus.
Provides a new standard for evaluating text-rich visual understanding.
Abstract
Comprehending text-rich visual content is paramount for the practical application of Multimodal Large Language Models (MLLMs), since text-rich scenarios are ubiquitous in the real world, which are characterized by the presence of extensive texts embedded within images. Recently, the advent of MLLMs with impressive versatility has raised the bar for what we can expect from MLLMs. However, their proficiency in text-rich scenarios has yet to be comprehensively and objectively assessed, since current MLLM benchmarks primarily focus on evaluating general visual comprehension. In this work, we introduce SEED-Bench-2-Plus, a benchmark specifically designed for evaluating \textbf{text-rich visual comprehension} of MLLMs. Our benchmark comprises 2.3K multiple-choice questions with precise human annotations, spanning three broad categories: Charts, Maps, and Webs, each of which covers a wide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsFocus
