MMR: Evaluating Reading Ability of Large Multimodal Models

Jian Chen; Ruiyi Zhang; Yufan Zhou; Ryan Rossi; Jiuxiang Gu; Changyou; Chen

arXiv:2408.14594·cs.CV·August 28, 2024

MMR: Evaluating Reading Ability of Large Multimodal Models

Jian Chen, Ruiyi Zhang, Yufan Zhou, Ryan Rossi, Jiuxiang Gu, Changyou, Chen

PDF

Open Access

TL;DR

This paper introduces the MMR benchmark, a comprehensive evaluation tool for assessing large multimodal models' complex reasoning and spatial understanding in text-rich images, revealing their current limitations.

Contribution

The paper presents the first human-annotated, multi-task benchmark for text-rich image understanding, highlighting the gaps in existing LMM capabilities.

Findings

01

Existing LMMs perform poorly on complex reasoning tasks.

02

The MMR benchmark exposes limitations of state-of-the-art models.

03

Current models do not fully understand spatial and contextual information.

Abstract

Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images. Most existing text-rich image benchmarks are simple extraction-based question answering, and many LMMs now easily achieve high scores. This means that current benchmarks fail to accurately reflect performance of different models, and a natural idea is to build a new benchmark to evaluate their complex reasoning and spatial understanding abilities. In this work, we propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding. MMR is the first text-rich image benchmark built on human annotations with the help of language models. By evaluating several state-of-the-art LMMs, including GPT-4o, it reveals the limited capabilities of existing LMMs underscoring the value of our benchmark.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Second Language Acquisition and Learning · Speech and dialogue systems